SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Ruby on Hadoop
Tuesday, January 8, 13
Introduction




                                      Hi.
                                   I’m Ted O’Meara
                         ...and I just quit my job last week.

                                    @tomeara
                                 tedomeara.com

Tuesday, January 8, 13
MapReduce
Tuesday, January 8, 13
History of MapReduce



        • First implemented
          by Google
        • Used in CouchDB,
          Hadoop, etc.
        • Helps to “distill” data into
          a concentrated result set




Tuesday, January 8, 13
What is MapReduce?




Tuesday, January 8, 13
What is MapReduce?




                                                                 sum = 0
   input = ["deer", "bear",
                                                                 input.each do |x|
   "river", "car", "car", "river",   input.map! { |x| [x, 1] }
                                                                   sum += x[1]
   "deer", "car", "bear"]
                                                                 end




Tuesday, January 8, 13
Hadoop Breakdown
Tuesday, January 8, 13
History of Hadoop



        •Doug Cutting @ Yahoo!
        •It is a Toy Elephant
        •It is also a framework for
         distributed computing
        •It is a distributed filesystem




Tuesday, January 8, 13
Network Topology


Tuesday, January 8, 13
Hadoop Cluster

                         Cluster
                         •Commodity hardware
                         •Partition tolerant
                         •Network-aware (rack-aware)



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         NameNode
                         •Keeps track of the DataNodes
                         •Uses “heartbeat” to determine a node’s health
                         •The most resources should be spent here



                          555.555.1.*             555.555.2.*                 444.444.1.*
                              JobTracker              NameNode                 TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode
                                                                          ♥    TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         DataNode
                         •Stores filesystem blocks
                         •Can be scaled. Spun up/down.
                         •Replicate based on a set replication factor



                          555.555.1.*             555.555.2.*               444.444.1.*
                              JobTracker               NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         JobTracker
                         •Delegates which TaskTrackers should handle a
                          MapReduce job
                         •Communicates with the NameNode to assign a TaskTracker
                          close to the DataNode where the source exists


                          555.555.1.*                 555.555.2.*              444.444.1.*
                              JobTracker                  NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode
                                                  ♥    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         TaskTracker
                         •Worker for MapReduce jobs
                         •The closer to the DataNode with the data, the better



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
HDFS


Tuesday, January 8, 13
HDFS

                                           hadoop fs -put localfile /user/hadoop/hadoopfile




                         555.555.1.*                   555.555.2.*                    444.444.1.*
                             JobTracker                      NameNode                   TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming


Tuesday, January 8, 13
Hadoop Streaming
        $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
                          -input "/user/me/samples/cachefile/input.txt" 
                          -mapper "xargs cat" 
                          -reducer "cat" 
                          -output "/user/me/samples/cachefile/out" 
                          -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' 
                          -jobconf mapred.map.tasks=3 
                          -jobconf mapred.reduce.tasks=3 
                          -jobconf mapred.job.name="Experiment"




                         555.555.1.*                  555.555.2.*                      444.444.1.*
                             JobTracker                     NameNode                     TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming




                          Pig        Hive          Wukong
                         Pig Latin   SQL-ish         Ruby!




           Hadoop Ecosystem
Tuesday, January 8, 13
Wukong


        •Infochimps
        •Currently going through
         heavy development
        •Use the 3.0.0.pre3 gem
            https://github.com/infochimps-labs/wukong/tree/3.0.0

        •Model your jobs with
         wukong-hadoop
            https://github.com/infochimps-labs/wukong-hadoop




Tuesday, January 8, 13
Wukong



            Wukong                             wukong-hadoop
            •Write mappers and reducers        •A CLI to use with Hadoop
             using Ruby                        •Created around building tasks
            •As of 3.0.0, Wukong uses           with Wukong
             “Processors”, which are Ruby      •Better than piping in the shell
             classes that define map, reduce,
                                                (you can see this with --dry_run)
             and other tasks




Tuesday, January 8, 13
Wukong Processors

                                     Wukong.processor(:mapper) do
                                       
                                       field :min_length, Integer, :default    =>   1
                                       field :max_length, Integer, :default    =>   256
                                       field :split_on,   Regexp,   :default   =>   /s+/
                                       field :remove,     Regexp,   :default   =>   /[^a-zA-Z0-9']+/
                                       field :fold_case, :boolean, :default    =>   false
                                       
                                       def process string

        •Fields are accessible           tokenize(string).each do |token|
                                           yield token if acceptable?(token)
                                         end
         through switches in shell     end

                                       private
        •Local hand-off is made at      def tokenize string
                                         string.split(split_on).map do |token|
         STDOUT to STDIN                   stripped = token.gsub(remove, '')
                                           fold_case ? stripped.downcase : stripped
                                         end
                                       end

                                       def acceptable? token
                                         (min_length..max_length).include?(token.length)
                                       end
                                     end




Tuesday, January 8, 13
Wukong Processors



                         Wukong.processor(:reducer, Wukong::Processor::Accumulator) do

                           attr_accessor :count
                           
                           def start record
                             self.count = 0
                           end
                           
                           def accumulate record
                             self.count += 1
                           end

                           def finalize
                             yield [key, count].join("t")
                           end
                         end




Tuesday, January 8, 13
Wukong Processors

           wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb 
                            --mode=local 
                            --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub




                                      Simpsons - Ep 8
                                      do 7
                                      Doctor     1
                                      Does 2
                                      doesn't    1
                                      dog 2
                                      D'oh 1
                                      doif 1
                                      doing      2
                                      done 1
                                      doneYou    1
                                      don't 10
                                      Don't 1




Tuesday, January 8, 13
The End




                         Thank you!
                             @tomeara
                             ted@tedomeara.com




Tuesday, January 8, 13

Contenu connexe

Similaire à Ruby on hadoop

Similaire à Ruby on hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pp1tx present
Pp1tx presentPp1tx present
Pp1tx present
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
Test schedule
Test scheduleTest schedule
Test schedule
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
 
Pp1tx present
Pp1tx presentPp1tx present
Pp1tx present
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 
bitch please
bitch pleasebitch please
bitch please
 
Ppt1x present
Ppt1x presentPpt1x present
Ppt1x present
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 
My bar
My barMy bar
My bar
 
Pptx present
Pptx presentPptx present
Pptx present
 
Ppt1x present
Ppt1x presentPpt1x present
Ppt1x present
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Ruby on hadoop

  • 1. Ruby on Hadoop Tuesday, January 8, 13
  • 2. Introduction Hi. I’m Ted O’Meara ...and I just quit my job last week. @tomeara tedomeara.com Tuesday, January 8, 13
  • 4. History of MapReduce • First implemented by Google • Used in CouchDB, Hadoop, etc. • Helps to “distill” data into a concentrated result set Tuesday, January 8, 13
  • 6. What is MapReduce? sum = 0 input = ["deer", "bear", input.each do |x| "river", "car", "car", "river", input.map! { |x| [x, 1] } sum += x[1] "deer", "car", "bear"] end Tuesday, January 8, 13
  • 8. History of Hadoop •Doug Cutting @ Yahoo! •It is a Toy Elephant •It is also a framework for distributed computing •It is a distributed filesystem Tuesday, January 8, 13
  • 10. Hadoop Cluster Cluster •Commodity hardware •Partition tolerant •Network-aware (rack-aware) 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 11. Hadoop Cluster NameNode •Keeps track of the DataNodes •Uses “heartbeat” to determine a node’s health •The most resources should be spent here 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 12. Hadoop Cluster DataNode •Stores filesystem blocks •Can be scaled. Spun up/down. •Replicate based on a set replication factor 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 13. Hadoop Cluster JobTracker •Delegates which TaskTrackers should handle a MapReduce job •Communicates with the NameNode to assign a TaskTracker close to the DataNode where the source exists 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 14. Hadoop Cluster TaskTracker •Worker for MapReduce jobs •The closer to the DataNode with the data, the better 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 16. HDFS hadoop fs -put localfile /user/hadoop/hadoopfile 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 18. Hadoop Streaming $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input "/user/me/samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "/user/me/samples/cachefile/out" -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' -jobconf mapred.map.tasks=3 -jobconf mapred.reduce.tasks=3 -jobconf mapred.job.name="Experiment" 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 19. Hadoop Streaming Pig Hive Wukong Pig Latin SQL-ish Ruby! Hadoop Ecosystem Tuesday, January 8, 13
  • 20. Wukong •Infochimps •Currently going through heavy development •Use the 3.0.0.pre3 gem https://github.com/infochimps-labs/wukong/tree/3.0.0 •Model your jobs with wukong-hadoop https://github.com/infochimps-labs/wukong-hadoop Tuesday, January 8, 13
  • 21. Wukong Wukong wukong-hadoop •Write mappers and reducers •A CLI to use with Hadoop using Ruby •Created around building tasks •As of 3.0.0, Wukong uses with Wukong “Processors”, which are Ruby •Better than piping in the shell classes that define map, reduce, (you can see this with --dry_run) and other tasks Tuesday, January 8, 13
  • 22. Wukong Processors Wukong.processor(:mapper) do      field :min_length, Integer, :default => 1   field :max_length, Integer, :default => 256   field :split_on, Regexp, :default => /s+/   field :remove, Regexp, :default => /[^a-zA-Z0-9']+/   field :fold_case, :boolean, :default => false      def process string •Fields are accessible     tokenize(string).each do |token|       yield token if acceptable?(token)     end through switches in shell   end   private •Local hand-off is made at   def tokenize string     string.split(split_on).map do |token| STDOUT to STDIN       stripped = token.gsub(remove, '')       fold_case ? stripped.downcase : stripped     end   end   def acceptable? token     (min_length..max_length).include?(token.length)   end end Tuesday, January 8, 13
  • 23. Wukong Processors Wukong.processor(:reducer, Wukong::Processor::Accumulator) do   attr_accessor :count      def start record     self.count = 0   end      def accumulate record     self.count += 1   end   def finalize     yield [key, count].join("t")   end end Tuesday, January 8, 13
  • 24. Wukong Processors wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb --mode=local --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub Simpsons - Ep 8 do 7 Doctor 1 Does 2 doesn't 1 dog 2 D'oh 1 doif 1 doing 2 done 1 doneYou 1 don't 10 Don't 1 Tuesday, January 8, 13
  • 25. The End Thank you! @tomeara ted@tedomeara.com Tuesday, January 8, 13