SlideShare une entreprise Scribd logo
1  sur  26
Hadoop
Streaming and Pipes
         July 10, 2012
           Clay Jiang
  Big Data Engineering Team
         Hanborq Inc.
Hadoop Streaming
• Hadoop Streaming 是一个将任何可执行程序
  /脚本当成Map/Reduce来执行MR Job的工具

• $HADOOP_HOME/contrib/streaming/hadoop-
  streaming-*.jar




                                       2
First Streaming Run
• 基本命令:
 – hadoop jar
   $HADOOP_HOME/contrib/streaming/hadoop
   -streaming-*.jar 
   -input /path/to/inputdir 
   -output /path/to/outputdir 
   -mapper /path/to/map_exec 
   -reducer /path/to/reduce_exec




                                           3
How Streaming Works?
• Mapper/Reducer将
  map_exec/reduce_exec作为单独进程启
  动
• Mapper/Reducer通过stdin和stdout传输
  <key,value>
• <key,value>以约定的形式传输给
  map_exec/reduce_exec,默认形式
  为”keytvalue”

                               4
How Streaming Works?




                       5
Hadoop Streaming Example
• Streaming WordCount




                               6
Streaming Internal
• 只是工具,不是新的机制
• 在原有的MapReduce框架上,增加适配层:
 – PipeMapper + PipeMapRunner
 – PipeCombiner
 – PipeReducer
 – No PipePartitioner




                                7
Streaming Internal
PipeMapper/PipeReducer负责与可执行程序通过
             stdin/stdout传输数据




                                   8
Streaming Internal
• hadoop-streaming*.jar主入口:



• 三个工具其中之一:




                              9
Streaming-StreamJob
• StreamJob
  – parseArgv:
     • Argv  Field Member

  – setJobConf:
     • Field Member  JobConf

  – submitAndMonitorJob:
     • JobConf submit to JobClient

                                     10
Streaming Map
• -mapper <cmd|JavaClassName>
• PipeMapRunner/PipeMapper
  – startOutputThreads: 启动线程MROutputThread
    来“tail”map_exec的stdout,并使用
    OutputReader 读取输出,解析后写到collector上
  – PipeMapper.map: 使用InputWriter,将key/value
    写成map_exec可以解析的字符串,写到
    map_exec的 stdin


                                           11
Streaming Reduce
• -reducer <cmd|JavaClassName>
• PipeReducer
  – 倚靠MapReduce内部机制shuffle到reducer
  – startOutputThread: 首次reduce时,类似地启动
    MROutputThread来收集“reducer cmd”的stdout
  – 类似地,使用inputWriter来翻译reduce的
    key/values,逐对提供给“reducer cmd”



                                            12
InputWriter/OutputReader
• InputWriter
  – 将<key,value>按预定的编码写到可执行程序的stdin

• OutputReader
  – 读取可执行程序的stdout并解编码为<key,value>

• InputWriter + OutputReader
  – 形成Java进程与map/reduce可执行进程的数据传输协议


                                      13
TextInputWriter/TextOutputReader

• 默认使用:
  – TextInputWriter、TextOutputReader
• <key,value>  key + separator + value
• 默认separator: t




                                           14
Streaming Data Flow




                      15
Streaming Combiner
• -combiner <cmd|JavaClassName>

• PipeCombiner简单地继承了PipeReducer,流
 程与PipeReducer相同




                                  16
Streaming Partitioner
• -partitioner <javaClassName>
• 目前而言,partitioner必须为java类




                                 17
Streaming I/O Format
• -inputFormat <javaClassName>
  – JobConf.setInputFormat()

• -outputFormat <javaClassName>
  – JobConf.setOutputFormat()

• -inputreader <javaClassName>:
  • 使用StreamInputFormat 作为InputFormat

                                        18
Streaming IO Spec
• TextInputWriter/TextOuputReader:
  – stream.map/reduce.output.field.separator
     • map/reduce可执行程序输出使用的separator
  – stream.map/reduce.input.field.separator
     • map/reduce可执行程序输入使用的separator
  – stream.num/reduce.map.output.key.fields
     • Separator将行分割成多个field,指定若干个fields作
       为key



                                               19
Streaming IO Spec
• -io text|rawbytes|typedbytes
  – text  TextInputWriter/TextOutputReader
  – rawbytes 
    RawBytesInputWriter/RawBytesOutputReader
  – typedbytes 
    TypedBytesInputWriter/TypedOutputReader
  – 由IdentifierResolver解析选项




                                               20
User-Defined IO Spec
• MyInputWriter/MyOutputReader
  – extend InputWriter/OutputReader
• MyIdentifierResovler
  – extend IdentifierResovler
  – 用于解析 my 
    MyInputWriter/MyOutputReader
  – -Dstream.io.identifier.resolver.class
    MyIdentifierResovler


                                        21
Debug Streaming
• -mapdebug/-reducedebug
  – 当map/reduce task执行失败时,执行debug脚本
  – $script $stdout $stderr $syslog $jobconf
• -debug
  – 执行完毕时,不删除
    /tmp/${user.name}/streamjob.jar




                                           22
V.S. Hadoop Pipes
• Stdin/stdout  Socket

• 限定I/O接口 
 $HADOOP_HOME/c++/$PLATFORM/include
  – HadoopPipes::Mapper::map(MapContext& context)

  – HadoopPipes::Reducer::reduce(ReduceContext& context)

• Performance: One better than the other?




                                                           23
V.S. Hadoop Pipes
• 实现上很相似
 – PipeMapper/PipeReducer 
  PipesMapper/PipesReducer
 – InputWriter/OuputReader 
  Application
 – 任何可执行程序 Pipes客户端需要链接
  c++库

                                24
参考
• (1)《Hadoop the definitive guide》
• (2)Hadoop Streaming -
  http://hadoop.apache.org/common/docs/r0.20.2/streaming.
  html
• (3)How to Debug Map/Reduce Programs
  http://wiki.apache.org/hadoop/HowToDebugMapReduceProg
  rams
• (4)Hadoop Wiki http://wiki.apache.org/hadoop/




                                                        25
The End
Thank You Very Much!
    chiangbing@gmail.com




                           26

Contenu connexe

Tendances

Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first classalogarg
 

Tendances (19)

Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
01 hbase
01 hbase01 hbase
01 hbase
 

En vedette

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDataWorks Summit
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

En vedette (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using Hadoop
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
информатика 5. информация сообщение
информатика 5. информация сообщениеинформатика 5. информация сообщение
информатика 5. информация сообщение
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Types of pipes
Types of pipesTypes of pipes
Types of pipes
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Similaire à Hadoop MapReduce Streaming and Pipes

Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex ApplicationApache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache ApexYogi Devendra Vyavahare
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentApache Apex
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Ganesh Raju
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 Sri Ambati
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Cloud Foundry Open Tour China
Cloud Foundry Open Tour ChinaCloud Foundry Open Tour China
Cloud Foundry Open Tour Chinamarklucovsky
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 

Similaire à Hadoop MapReduce Streaming and Pipes (20)

Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex Application
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App Development
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Cloud Foundry Open Tour China
Cloud Foundry Open Tour ChinaCloud Foundry Open Tour China
Cloud Foundry Open Tour China
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 

Plus de Hanborq Inc.

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraHanborq Inc.
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHanborq Inc.
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Hanborq Inc.
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive IntroductionHanborq Inc.
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase IntroductionHanborq Inc.
 
Hadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHanborq Inc.
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
How to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHow to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHanborq Inc.
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 

Plus de Hanborq Inc. (12)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
FlumeBase Study
FlumeBase StudyFlumeBase Study
FlumeBase Study
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase Introduction
 
Hadoop Versioning
Hadoop VersioningHadoop Versioning
Hadoop Versioning
 
Hadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler Introduction
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
How to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHow to Build Cloud Storage Service Systems
How to Build Cloud Storage Service Systems
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Hadoop MapReduce Streaming and Pipes

  • 1. Hadoop Streaming and Pipes July 10, 2012 Clay Jiang Big Data Engineering Team Hanborq Inc.
  • 2. Hadoop Streaming • Hadoop Streaming 是一个将任何可执行程序 /脚本当成Map/Reduce来执行MR Job的工具 • $HADOOP_HOME/contrib/streaming/hadoop- streaming-*.jar 2
  • 3. First Streaming Run • 基本命令: – hadoop jar $HADOOP_HOME/contrib/streaming/hadoop -streaming-*.jar -input /path/to/inputdir -output /path/to/outputdir -mapper /path/to/map_exec -reducer /path/to/reduce_exec 3
  • 4. How Streaming Works? • Mapper/Reducer将 map_exec/reduce_exec作为单独进程启 动 • Mapper/Reducer通过stdin和stdout传输 <key,value> • <key,value>以约定的形式传输给 map_exec/reduce_exec,默认形式 为”keytvalue” 4
  • 6. Hadoop Streaming Example • Streaming WordCount 6
  • 7. Streaming Internal • 只是工具,不是新的机制 • 在原有的MapReduce框架上,增加适配层: – PipeMapper + PipeMapRunner – PipeCombiner – PipeReducer – No PipePartitioner 7
  • 10. Streaming-StreamJob • StreamJob – parseArgv: • Argv  Field Member – setJobConf: • Field Member  JobConf – submitAndMonitorJob: • JobConf submit to JobClient 10
  • 11. Streaming Map • -mapper <cmd|JavaClassName> • PipeMapRunner/PipeMapper – startOutputThreads: 启动线程MROutputThread 来“tail”map_exec的stdout,并使用 OutputReader 读取输出,解析后写到collector上 – PipeMapper.map: 使用InputWriter,将key/value 写成map_exec可以解析的字符串,写到 map_exec的 stdin 11
  • 12. Streaming Reduce • -reducer <cmd|JavaClassName> • PipeReducer – 倚靠MapReduce内部机制shuffle到reducer – startOutputThread: 首次reduce时,类似地启动 MROutputThread来收集“reducer cmd”的stdout – 类似地,使用inputWriter来翻译reduce的 key/values,逐对提供给“reducer cmd” 12
  • 13. InputWriter/OutputReader • InputWriter – 将<key,value>按预定的编码写到可执行程序的stdin • OutputReader – 读取可执行程序的stdout并解编码为<key,value> • InputWriter + OutputReader – 形成Java进程与map/reduce可执行进程的数据传输协议 13
  • 14. TextInputWriter/TextOutputReader • 默认使用: – TextInputWriter、TextOutputReader • <key,value>  key + separator + value • 默认separator: t 14
  • 16. Streaming Combiner • -combiner <cmd|JavaClassName> • PipeCombiner简单地继承了PipeReducer,流 程与PipeReducer相同 16
  • 17. Streaming Partitioner • -partitioner <javaClassName> • 目前而言,partitioner必须为java类 17
  • 18. Streaming I/O Format • -inputFormat <javaClassName> – JobConf.setInputFormat() • -outputFormat <javaClassName> – JobConf.setOutputFormat() • -inputreader <javaClassName>: • 使用StreamInputFormat 作为InputFormat 18
  • 19. Streaming IO Spec • TextInputWriter/TextOuputReader: – stream.map/reduce.output.field.separator • map/reduce可执行程序输出使用的separator – stream.map/reduce.input.field.separator • map/reduce可执行程序输入使用的separator – stream.num/reduce.map.output.key.fields • Separator将行分割成多个field,指定若干个fields作 为key 19
  • 20. Streaming IO Spec • -io text|rawbytes|typedbytes – text  TextInputWriter/TextOutputReader – rawbytes  RawBytesInputWriter/RawBytesOutputReader – typedbytes  TypedBytesInputWriter/TypedOutputReader – 由IdentifierResolver解析选项 20
  • 21. User-Defined IO Spec • MyInputWriter/MyOutputReader – extend InputWriter/OutputReader • MyIdentifierResovler – extend IdentifierResovler – 用于解析 my  MyInputWriter/MyOutputReader – -Dstream.io.identifier.resolver.class MyIdentifierResovler 21
  • 22. Debug Streaming • -mapdebug/-reducedebug – 当map/reduce task执行失败时,执行debug脚本 – $script $stdout $stderr $syslog $jobconf • -debug – 执行完毕时,不删除 /tmp/${user.name}/streamjob.jar 22
  • 23. V.S. Hadoop Pipes • Stdin/stdout  Socket • 限定I/O接口  $HADOOP_HOME/c++/$PLATFORM/include – HadoopPipes::Mapper::map(MapContext& context) – HadoopPipes::Reducer::reduce(ReduceContext& context) • Performance: One better than the other? 23
  • 24. V.S. Hadoop Pipes • 实现上很相似 – PipeMapper/PipeReducer  PipesMapper/PipesReducer – InputWriter/OuputReader  Application – 任何可执行程序 Pipes客户端需要链接 c++库 24
  • 25. 参考 • (1)《Hadoop the definitive guide》 • (2)Hadoop Streaming - http://hadoop.apache.org/common/docs/r0.20.2/streaming. html • (3)How to Debug Map/Reduce Programs http://wiki.apache.org/hadoop/HowToDebugMapReduceProg rams • (4)Hadoop Wiki http://wiki.apache.org/hadoop/ 25
  • 26. The End Thank You Very Much! chiangbing@gmail.com 26