SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
BIG DATA
                                  @nuboat
                         Peerapat Asoktummarungsri




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Who am I?

                         Software Engineer @ FICO

                         Blogger @ Thailand Java User Group

                         Failed Startuper

                         Cat Lover

                         เกรียนตัวน้อยๆ ในทวีตเตอร์



(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
What is Big Data?


                         Big data analytics is concerned with the analysis of large volumes
                         of transaction/event data and behavioral analysis of human/human
                         a human/system interactions. (Gartner)

                         Big data represents the collection of technologies that handle large
                         data volumes well beyond that inflection point and for which, at
                         least in theory, hardware resources required to manage data by
                         volume track to an almost straight line rather than a curve. (IDC)




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Structure & Nonstructure
                            Level           Example

                         Structure           RDBMS

                    Semi-structure         XML, JSON

                   Quasi-structure        Text Document

                         Unstructure      Image, Video


(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Source: go-globe.com

      Unstructure data around the world in 1min.
(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
The challenges associated with
                   Big Data are




                     Source: oracle.com




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Use The Right tool


                         RDBMS                        HADOOP
              Interactive Reporting (<1s)    Affordable Storage/Compute

              Multistep Transaction          Structure of Not (Agility)

              Lots of Insert/Update/Delete   Resilent Auto Scalability


(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
ZooKeeper


                                                          Cassandra,
                         Hive      Sqoop          Pig
                                                         Mahout, ฯลฯ



                           Map Reduce             YARN     COMMON       Z       F


                         HBASE
                                           HDFS


                                                                             Flume




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
HDFS         Hadoop Distributed File System
               1            1         A           1   D
               2            2                     5
               3
                            5                     3
               4                          1   C
               5                          4
                                          3
                            4         B           2   E
           Replication      2                     5
           Factor = 3
                            3                     4
(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
HDFS         Hadoop Distributed File System
               1            1         A       1   D
               2            2                 5
               3
                            5                 3
               4
               5

                            4         B       2   E
                            2                 5
                            3                 4
(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Client
      Master Slave mode only                       Map Reduce

                         Name Node   HDFS            Job Tracker



                   Data Node         Data Node            Data Node

                   Data Node         Data Node            Data Node

                   Data Node         Data Node            Data Node


(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Install & Config
                         Install JDK
                         Adding Hadoop System User
                         Config SSH & Passwordless
                         Disabling IPv6
                         Installing Hadoop, chown, .sh, conf.xml
                         Formatting HDFS
                         Start Hadoop
                         Hadoop Web Console
                         Stop Hadoop

(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
1.  Install Java
     [root@localhost ~]# ./jdk-6u39-linux-i586.bin
     [root@localhost ~]# chown -R root:root ./jdk1.6.0_39/
     [root@localhost ~]# mv jdk1.6.0_39/ /usr/lib/jvm/
     [root@localhost ~]# update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.6.0_39/bin/java"
     1
     [root@localhost ~]# update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.6.0_39/bin/
     javac" 1
     [root@localhost ~]# update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.6.0_39/bin/
     javaws" 1
     [root@localhost ~]# update-alternatives --config java
     There are 3 programs which provide 'java'.

       Selection    Command
     -----------------------------------------------
     *  1           /usr/lib/jvm/jre-1.6.0-openjdk/bin/java
        2           /usr/lib/jvm/jre-1.5.0-gcj/bin/java
     + 3           /usr/lib/jvm/jdk1.6.0_39/bin/java

     Enter to keep the current selection[+], or type selection number: 3

     [root@localhost ~]# java -version
     java version "1.6.0_39"
     Java(TM) SE Runtime Environment (build 1.6.0_39-b04)
     Java HotSpot(TM) Server VM (build 20.14-b01, mixed mode)



(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
2.  Create hadoop user
                         [root@localhost ~]# groupadd hadoop
                         [root@localhost ~]# useradd -g hadoop -d /home/hdadmin hdadmin
                         [root@localhost ~]# passwd hdadmin

                         3.  Config SSH
                         [root@localhost ~]# service sshd start
                         ...
                         [root@localhost ~]# chkconfig sshd on
                         [root@localhost ~]# su - hdadmin
                         [root@localhost ~]# ssh-keygen -t rsa -P ""
                         $ ~/.ssh/id_rsa.pub >> ~/.ssh/authroized_keys
                         $ ssh localhost

                         4. Disable IPv6
                         [root@localhost ~]# vi /etc/sysctl.conf

                         # disable ipv6
                         net.ipv6.conf.all.disable_ipv6 = 1
                         net.ipv6.conf.default.disable_ipv6 = 1
                         net.ipv6.conf.lo.disable_ipv6 = 1




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
5. SET Path
                         $ vi ~/.bashrc

                         # .bashrc

                         # Source global definitions
                         if [ -f /etc/bashrc ]; then
                                 . /etc/bashrc
                         fi

                         # Do not set HADOOP_HOME
                         # User specific aliases and functions
                         export HIVE_HOME=/usr/local/hive-0.9.0
                         export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39
                         export PATH=$PATH:/usr/local/hadoop-0.20.205.0/bin:$JAVA_HOME/bin:
                         $HIVE_HOME/bin




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
6. Install Hadoop
                         [root@localhost ~]# tar xzf hadoop-0.20.205.0-bin.tar.gz
                         [root@localhost ~]# mv hadoop-0.20.205.0 /usr/local/hadoop-0.20.205.0
                         [root@localhost ~]# cd /usr/local
                         [root@localhost ~]# chown -R hdadmin:hadoop hadoop-0.20.205.0

                         [root@localhost ~]# mkdir -p /data/hadoop
                         [root@localhost ~]# chown hdadmin:hadoop /data/hadoop
                         [root@localhost ~]# chmod 750 /data/hadoop




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
7. Config HADOOP
                $ vi /usr/local/hadoop-0.20.205.0/conf/hadoop-env.sh

                # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
                export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39
                export HADOOP_HOME_WARN_SUPPRESS="TRUE"

                $ vi /usr/local/hadoop-0.20.205.0/conf/core-site.sh

                <configuration>
                <property>
                  <name>hadoop.tmp.dir</name>
                  <value>/data/hadoop</value>
                  <description>A base for other temporary directories.</description>
                </property>
                <property>
                  <name>fs.default.name</name>
                  <value>hdfs://localhost:54310</value>
                  <description>
                The name of the default file system. A URI whose scheme and authority determine the
                FileSystem implementation.
                The uri's scheme determine the config property(fs..SCHEME.impl) naming the FileSystem
                implementation class.
                The uri's authority is used to determine the host, port, etc,. for a filesystem.
                </description>
                </property>
                </configuration>


(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
$ vi /usr/local/hadoop-0.20.205.0/conf/hdfs-site.sh

       <configuration>
       <property>
         <name>dfs.replication</name>
         <value>1</value>
         <description>
           Default block replication. The actual number of replications can be specified when the file is created.
           The default is used if replication is not specified in create time.
         </description>
       </property>
       </configuration>

       $ vi /usr/local/hadoop-0.20.205.0/conf/mapred-site.xml

       <configuration>
       <property>
         <name>mapred.job.tracker</name>
         <value>localhost:54311</value>
         <description>
           The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a
           single map and reduce task.</description>
       </property>
       </configuration>




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
8. Format HDFS
$ hadoop namenode -format
13/04/03 13:59:54 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.205.0
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
-r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:26:14 UTC 2011
************************************************************/
13/04/03 14:00:02 INFO util.GSet: VM type       = 32-bit
13/04/03 14:00:02 INFO util.GSet: 2% max memory = 17.77875 MB
13/04/03 14:00:02 INFO util.GSet: capacity      = 2^22 = 4194304 entries
13/04/03 14:00:02 INFO util.GSet: recommended=4194304, actual=4194304
13/04/03 14:00:02 INFO namenode.FSNamesystem: fsOwner=hdadmin
13/04/03 14:00:02 INFO namenode.FSNamesystem: supergroup=supergroup
13/04/03 14:00:02 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/04/03 14:00:02 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/04/03 14:00:02 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0
min(s), accessTokenLifetime=0 min(s)
13/04/03 14:00:02 INFO namenode.NameNode: Caching file names occuring more than 10 times 
13/04/03 14:00:02 INFO common.Storage: Image file of size 113 saved in 0 seconds.
13/04/03 14:00:02 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully
formatted.
13/04/03 14:00:02 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/

(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
9. Start NameNode, DataNode, JobTracker and TaskTracker
      $ cd /usr/local/hadoop/bin/
      $ start-all.sh

      starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-namenode-
      localhost.localdomain.out
      localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-datanode-
      localhost.localdomain.out
      localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-
      secondarynamenode-localhost.localdomain.out
      starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-jobtracker-
      localhost.localdomain.out
      localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-tasktracker-
      localhost.localdomain.out

      $ jps

      27410       SecondaryNameNode
      27643       TaskTracker
      27758       Jps
      27504       JobTracker
      27110       NameNode
      27259       DataNode




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
10. Browse HDFS via Browser
                   http://localhost:50070/

                   NameNode 'localhost.localdomain:54310'
                   Started:      Wed Apr 03 14:11:39 ICT 2013
                   Version:      0.20.205.0, r1179940
                   Compiled:      Fri Oct 7 06:26:14 UTC 2011 by hortonfo
                   Upgrades:      There are no upgrades in progress.

                   Browse the filesystem
                   Namenode Logs
                   Cluster Summary
                   6 files and directories, 1 blocks = 7 total. Heap Size is 58.88 MB / 888.94 MB (6%)
                   Configured Capacity     :     15.29 GB
                   DFS Used     :     28.01 KB
                   Non DFS Used     :     4.94 GB
                   DFS Remaining     :     10.36 GB
                   DFS Used%     :     0 %
                   DFS Remaining%     :     67.72 %
                   Live Nodes      :     1
                   Dead Nodes      :     0
                   Decommissioning Nodes      :     0
                   Number of Under-Replicated Blocks     :     0




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Map Reduce
                Map:      (k1, v1) -> list(k2, v2)

                Reduce: (k2, List(v2)) -> list(k3, v3)




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Map         Reduce




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Map




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Reduce




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Job




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
Execute


   $ hadoop jar ./wordcount.jar com.fico.Wordcount /input/* /output/wordcount_output_dir




(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13
ของแถม




Thursday, April 11, 13
HIVE

Hive is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in
Hadoop compatible file systems. Hive provides a mechanism to project structure
onto this data and query the data using a SQL-like language called HiveQL. At the
same time this language also allows traditional map/reduce programmers to plug
in their custom mappers and reducers when it is inconvenient or inefficient to
express this logic in HiveQL.




Thursday, April 11, 13
HIVE Command Ex.
             hive> CREATE TABLE pokes (foo INT, bar STRING);

             hive> SHOW TABLES;

             hive> DESCRIBE invites;

             hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

             hive> DROP TABLE pokes;

             hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt'
             OVERWRITE INTO TABLE pokes;

             hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';




Thursday, April 11, 13
SQOOP

             With Sqoop, you can import data from a relational database system
             into HDFS. The input to the import process is a database table. Sqoop
             will read the table row-by-row into HDFS. The output of this import
             process is a set of files containing a copy of the imported table. The
             import process is performed in parallel. For this reason, the output will
             be in multiple files. These files may be delimited text files (for
             example, with commas or tabs separating each field), or binary Avro
             or SequenceFiles containing serialized record data.




Thursday, April 11, 13
THANK YOU :)



(c) copyright 2013 nuboat in wonderland
Thursday, April 11, 13

Contenu connexe

Tendances

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentContinuent
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 

Tendances (20)

Hadoop
HadoopHadoop
Hadoop
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
HDFS
HDFSHDFS
HDFS
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 

En vedette (6)

Homeloan
HomeloanHomeloan
Homeloan
 
Roboguice
RoboguiceRoboguice
Roboguice
 
Meetup Big Data by THJUG
Meetup Big Data by THJUGMeetup Big Data by THJUG
Meetup Big Data by THJUG
 
Modern Java Development
Modern Java DevelopmentModern Java Development
Modern Java Development
 
Cassandra - Distributed Data Store
Cassandra - Distributed Data StoreCassandra - Distributed Data Store
Cassandra - Distributed Data Store
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 

Similaire à Hadoop

Cloudcamp Athens 2011 Presenting Heroku
Cloudcamp Athens 2011 Presenting HerokuCloudcamp Athens 2011 Presenting Heroku
Cloudcamp Athens 2011 Presenting HerokuSavvas Georgiou
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Nginx Workshop Aftermath
Nginx Workshop AftermathNginx Workshop Aftermath
Nginx Workshop AftermathDenis Zhdanov
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storagelohitvijayarenu
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
はじめてのGlusterFS
はじめてのGlusterFSはじめてのGlusterFS
はじめてのGlusterFSTakahiro Inoue
 
Configuration surgery with Augeas (OggCamp 12)
Configuration surgery with Augeas (OggCamp 12)Configuration surgery with Augeas (OggCamp 12)
Configuration surgery with Augeas (OggCamp 12)Dominic Cleal
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
containerit at useR!2017 conference, Brussels
containerit at useR!2017 conference, Brusselscontainerit at useR!2017 conference, Brussels
containerit at useR!2017 conference, BrusselsDaniel Nüst
 
Groovy and Grails
Groovy and GrailsGroovy and Grails
Groovy and GrailsGiltTech
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
Lessons from running potentially malicious code inside Docker containers
Lessons from running potentially malicious code inside Docker containersLessons from running potentially malicious code inside Docker containers
Lessons from running potentially malicious code inside Docker containersBen Hall
 

Similaire à Hadoop (20)

CloudInit Introduction
CloudInit IntroductionCloudInit Introduction
CloudInit Introduction
 
Web Server Free Bsd
Web Server Free BsdWeb Server Free Bsd
Web Server Free Bsd
 
Cloudcamp Athens 2011 Presenting Heroku
Cloudcamp Athens 2011 Presenting HerokuCloudcamp Athens 2011 Presenting Heroku
Cloudcamp Athens 2011 Presenting Heroku
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Nginx Workshop Aftermath
Nginx Workshop AftermathNginx Workshop Aftermath
Nginx Workshop Aftermath
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Oracle API Gateway Installation
Oracle API Gateway InstallationOracle API Gateway Installation
Oracle API Gateway Installation
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
はじめてのGlusterFS
はじめてのGlusterFSはじめてのGlusterFS
はじめてのGlusterFS
 
Lightweight javaEE with Guice
Lightweight javaEE with GuiceLightweight javaEE with Guice
Lightweight javaEE with Guice
 
Configuration surgery with Augeas (OggCamp 12)
Configuration surgery with Augeas (OggCamp 12)Configuration surgery with Augeas (OggCamp 12)
Configuration surgery with Augeas (OggCamp 12)
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
containerit at useR!2017 conference, Brussels
containerit at useR!2017 conference, Brusselscontainerit at useR!2017 conference, Brussels
containerit at useR!2017 conference, Brussels
 
Groovy and Grails
Groovy and GrailsGroovy and Grails
Groovy and Grails
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Lessons from running potentially malicious code inside Docker containers
Lessons from running potentially malicious code inside Docker containersLessons from running potentially malicious code inside Docker containers
Lessons from running potentially malicious code inside Docker containers
 

Plus de Peerapat Asoktummarungsri (6)

ePassport eKYC for Financial
ePassport eKYC for FinancialePassport eKYC for Financial
ePassport eKYC for Financial
 
Security Deployment by CI/CD
Security Deployment by CI/CDSecurity Deployment by CI/CD
Security Deployment by CI/CD
 
Sonarqube
SonarqubeSonarqube
Sonarqube
 
Sonar
SonarSonar
Sonar
 
Meet Django
Meet DjangoMeet Django
Meet Django
 
Easy java
Easy javaEasy java
Easy java
 

Hadoop

  • 1. BIG DATA @nuboat Peerapat Asoktummarungsri (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 2. Who am I? Software Engineer @ FICO Blogger @ Thailand Java User Group Failed Startuper Cat Lover เกรียนตัวน้อยๆ ในทวีตเตอร์ (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 3. What is Big Data? Big data analytics is concerned with the analysis of large volumes of transaction/event data and behavioral analysis of human/human a human/system interactions. (Gartner) Big data represents the collection of technologies that handle large data volumes well beyond that inflection point and for which, at least in theory, hardware resources required to manage data by volume track to an almost straight line rather than a curve. (IDC) (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 4. Structure & Nonstructure Level Example Structure RDBMS Semi-structure XML, JSON Quasi-structure Text Document Unstructure Image, Video (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 5. Source: go-globe.com Unstructure data around the world in 1min. (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 6. The challenges associated with Big Data are Source: oracle.com (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 7. Use The Right tool RDBMS HADOOP Interactive Reporting (<1s) Affordable Storage/Compute Multistep Transaction Structure of Not (Agility) Lots of Insert/Update/Delete Resilent Auto Scalability (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 8. ZooKeeper Cassandra, Hive Sqoop Pig Mahout, ฯลฯ Map Reduce YARN COMMON Z F HBASE HDFS Flume (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 9. HDFS Hadoop Distributed File System 1 1 A 1 D 2 2 5 3 5 3 4 1 C 5 4 3 4 B 2 E Replication 2 5 Factor = 3 3 4 (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 10. HDFS Hadoop Distributed File System 1 1 A 1 D 2 2 5 3 5 3 4 5 4 B 2 E 2 5 3 4 (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 11. Client Master Slave mode only Map Reduce Name Node HDFS Job Tracker Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 12. Install & Config Install JDK Adding Hadoop System User Config SSH & Passwordless Disabling IPv6 Installing Hadoop, chown, .sh, conf.xml Formatting HDFS Start Hadoop Hadoop Web Console Stop Hadoop (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 13. 1.  Install Java [root@localhost ~]# ./jdk-6u39-linux-i586.bin [root@localhost ~]# chown -R root:root ./jdk1.6.0_39/ [root@localhost ~]# mv jdk1.6.0_39/ /usr/lib/jvm/ [root@localhost ~]# update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.6.0_39/bin/java" 1 [root@localhost ~]# update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.6.0_39/bin/ javac" 1 [root@localhost ~]# update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.6.0_39/bin/ javaws" 1 [root@localhost ~]# update-alternatives --config java There are 3 programs which provide 'java'.   Selection    Command ----------------------------------------------- *  1           /usr/lib/jvm/jre-1.6.0-openjdk/bin/java    2           /usr/lib/jvm/jre-1.5.0-gcj/bin/java + 3           /usr/lib/jvm/jdk1.6.0_39/bin/java Enter to keep the current selection[+], or type selection number: 3 [root@localhost ~]# java -version java version "1.6.0_39" Java(TM) SE Runtime Environment (build 1.6.0_39-b04) Java HotSpot(TM) Server VM (build 20.14-b01, mixed mode) (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 14. 2.  Create hadoop user [root@localhost ~]# groupadd hadoop [root@localhost ~]# useradd -g hadoop -d /home/hdadmin hdadmin [root@localhost ~]# passwd hdadmin 3.  Config SSH [root@localhost ~]# service sshd start ... [root@localhost ~]# chkconfig sshd on [root@localhost ~]# su - hdadmin [root@localhost ~]# ssh-keygen -t rsa -P "" $ ~/.ssh/id_rsa.pub >> ~/.ssh/authroized_keys $ ssh localhost 4. Disable IPv6 [root@localhost ~]# vi /etc/sysctl.conf # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 15. 5. SET Path $ vi ~/.bashrc # .bashrc # Source global definitions if [ -f /etc/bashrc ]; then         . /etc/bashrc fi # Do not set HADOOP_HOME # User specific aliases and functions export HIVE_HOME=/usr/local/hive-0.9.0 export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39 export PATH=$PATH:/usr/local/hadoop-0.20.205.0/bin:$JAVA_HOME/bin: $HIVE_HOME/bin (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 16. 6. Install Hadoop [root@localhost ~]# tar xzf hadoop-0.20.205.0-bin.tar.gz [root@localhost ~]# mv hadoop-0.20.205.0 /usr/local/hadoop-0.20.205.0 [root@localhost ~]# cd /usr/local [root@localhost ~]# chown -R hdadmin:hadoop hadoop-0.20.205.0 [root@localhost ~]# mkdir -p /data/hadoop [root@localhost ~]# chown hdadmin:hadoop /data/hadoop [root@localhost ~]# chmod 750 /data/hadoop (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 17. 7. Config HADOOP $ vi /usr/local/hadoop-0.20.205.0/conf/hadoop-env.sh # export JAVA_HOME=/usr/lib/j2sdk1.5-sun export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_39 export HADOOP_HOME_WARN_SUPPRESS="TRUE" $ vi /usr/local/hadoop-0.20.205.0/conf/core-site.sh <configuration> <property>   <name>hadoop.tmp.dir</name>   <value>/data/hadoop</value>   <description>A base for other temporary directories.</description> </property> <property>   <name>fs.default.name</name>   <value>hdfs://localhost:54310</value>   <description> The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determine the config property(fs..SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc,. for a filesystem. </description> </property> </configuration> (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 18. $ vi /usr/local/hadoop-0.20.205.0/conf/hdfs-site.sh <configuration> <property>   <name>dfs.replication</name>   <value>1</value>   <description> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.   </description> </property> </configuration> $ vi /usr/local/hadoop-0.20.205.0/conf/mapred-site.xml <configuration> <property>   <name>mapred.job.tracker</name>   <value>localhost:54311</value>   <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description> </property> </configuration> (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 19. 8. Format HDFS $ hadoop namenode -format 13/04/03 13:59:54 INFO namenode.NameNode: STARTUP_MSG:  /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG:   host = localhost.localdomain/127.0.0.1 STARTUP_MSG:   args = [-format] STARTUP_MSG:   version = 0.20.205.0 STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 -r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:26:14 UTC 2011 ************************************************************/ 13/04/03 14:00:02 INFO util.GSet: VM type       = 32-bit 13/04/03 14:00:02 INFO util.GSet: 2% max memory = 17.77875 MB 13/04/03 14:00:02 INFO util.GSet: capacity      = 2^22 = 4194304 entries 13/04/03 14:00:02 INFO util.GSet: recommended=4194304, actual=4194304 13/04/03 14:00:02 INFO namenode.FSNamesystem: fsOwner=hdadmin 13/04/03 14:00:02 INFO namenode.FSNamesystem: supergroup=supergroup 13/04/03 14:00:02 INFO namenode.FSNamesystem: isPermissionEnabled=true 13/04/03 14:00:02 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 13/04/03 14:00:02 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 13/04/03 14:00:02 INFO namenode.NameNode: Caching file names occuring more than 10 times  13/04/03 14:00:02 INFO common.Storage: Image file of size 113 saved in 0 seconds. 13/04/03 14:00:02 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted. 13/04/03 14:00:02 INFO namenode.NameNode: SHUTDOWN_MSG:  /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1 ************************************************************/ (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 20. 9. Start NameNode, DataNode, JobTracker and TaskTracker $ cd /usr/local/hadoop/bin/ $ start-all.sh starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-namenode- localhost.localdomain.out localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-datanode- localhost.localdomain.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin- secondarynamenode-localhost.localdomain.out starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-jobtracker- localhost.localdomain.out localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hdadmin-tasktracker- localhost.localdomain.out $ jps 27410 SecondaryNameNode 27643 TaskTracker 27758 Jps 27504 JobTracker 27110 NameNode 27259 DataNode (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 21. 10. Browse HDFS via Browser http://localhost:50070/ NameNode 'localhost.localdomain:54310' Started:      Wed Apr 03 14:11:39 ICT 2013 Version:      0.20.205.0, r1179940 Compiled:      Fri Oct 7 06:26:14 UTC 2011 by hortonfo Upgrades:      There are no upgrades in progress. Browse the filesystem Namenode Logs Cluster Summary 6 files and directories, 1 blocks = 7 total. Heap Size is 58.88 MB / 888.94 MB (6%) Configured Capacity     :     15.29 GB DFS Used     :     28.01 KB Non DFS Used     :     4.94 GB DFS Remaining     :     10.36 GB DFS Used%     :     0 % DFS Remaining%     :     67.72 % Live Nodes      :     1 Dead Nodes      :     0 Decommissioning Nodes      :     0 Number of Under-Replicated Blocks     :     0 (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 22. Map Reduce Map: (k1, v1) -> list(k2, v2) Reduce: (k2, List(v2)) -> list(k3, v3) (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 23. Map Reduce (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 24. Map (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 25. Reduce (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 26. Job (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 27. Execute $ hadoop jar ./wordcount.jar com.fico.Wordcount /input/* /output/wordcount_output_dir (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13
  • 29. HIVE Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Thursday, April 11, 13
  • 30. HIVE Command Ex. hive> CREATE TABLE pokes (foo INT, bar STRING); hive> SHOW TABLES; hive> DESCRIBE invites; hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> DROP TABLE pokes; hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; Thursday, April 11, 13
  • 31. SQOOP With Sqoop, you can import data from a relational database system into HDFS. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data. Thursday, April 11, 13
  • 32. THANK YOU :) (c) copyright 2013 nuboat in wonderland Thursday, April 11, 13