SlideShare a Scribd company logo
1 of 95
Tech share
• Hadoop Core, our flagship sub-
  project, provides a distributed filesystem
  (HDFS) and support for the MapReduce
  distributed computing metaphor.
• Pig is a high-level data-flow language and
  execution framework for parallel computation.
  It is built on top of Hadoop Core.
ZooKeeper
• ZooKeeper is a highly available and reliable
  coordination system. Distributed applications
  use ZooKeeper to store and mediate updates
  for critical shared state.
JobTracker
• JobTracker: The JobTracker provides command
  and control for job management. It supplies
  the primary user interface to a MapReduce
  cluster. It also handles the distribution and
  management of tasks. There is one instance of
  this server running on a cluster. The machine
  running the JobTracker server is the
  MapReduce master.
TaskTracker
• TaskTracker: The TaskTracker provides
  execution services for the submitted jobs.
  Each TaskTracker manages the execution of
  tasks on an individual compute node in the
  MapReduce cluster. The JobTracker manages
  all of the TaskTracker processes. There is one
  instance of this server per compute node.
NameNode
• NameNode: The NameNode provides metadata
  storage for the shared file system. The
  NameNode supplies the primary user interface to
  the HDFS. It also manages all of the metadata for
  the HDFS. There is one instance of this server
  running on a cluster. The metadata includes such
  critical information as the file directory structure
  and which DataNodes have copies of the data
  blocks that contain each file’s data. The machine
  running the NameNode server process is the
  HDFS master.
Secondary NameNode
• Secondary NameNode: The secondary
  NameNode provides both file system metadata
  backup and metadata compaction. It supplies
  near real-time backup of the metadata for the
  NameNode. There is at least one instance of this
  server running on a cluster, ideally on a separate
  physical machine than the one running the
  NameNode. The secondary NameNode also
  merges the metadata change history, the edit
  log, into the NameNode’s file system image.
Design of HDFS
• Design of HDFS
  – Very large files
  – Streaming data access
  – Commodity hardware
• not a good fit
  – Low-latency data access
  – Lots of small files
  – Multiple writers, arbitrary file modifications
blocks
• normally 512 bytes
• HDFS : 64 MB by default
HDFS文件读取
               内存
•
HDFS文件写入
使用
• HDFS初次建立之前需要格式化namenode
 – hadoop namenode –format
HDFS文件写入
• Outputsream.write()
• Outputstream.flush() 刷新,超过一个block
  的时候,才会读到。
• Outputstream.sync() 强制同步
• Outputstream.close() 包括sync()
DistCp分布式复制
• hadoop distcp -update hdfs://namenode1/foo
  hdfs://namenode2/bar

• hadoop distcp –update ……
  – 只更新修改过的文件
• hadoop distcp –overwrite ……
  – 覆盖
• hadoop distcp –m 100 ……
  – 复制任务被分成N个MAP执行
Hadoop 文件归档
• Har文件

• Hadoop archive –archiveName file.har
  /myfiles /outpath

• Hadoop fs –ls /outpath/file.har
• Hadoop fs –lsr har:///outpath/file.har
文件操作
• Hadoop fs –rm hdfs://192.168.126.133:9000/xxx


   •cat             •cp         •lsr             •rmr
   •chgrp           •du         •mkdir           •setrep
   •chmod           •dus        •moveFromLocal   •stat
   •chown           •expunge    •moveToLocal     •tail
   •copyFromLocal   •get        •mv              •test
   •copyToLocal     •getmerge   •put             •text
   •count           •ls         •rm              •touchz
分布式部署
• Master&slave 192.168.0.10
• Slave 192.168.0.20

• 修改conf/master
  – 192.168.0.10
• 修改Conf/slave
  – 192.168.0.10
  – 192.168.0.20
安装hadoop
• ssh-keygen-tdsa –P '‘ –f ~/.ssh/id_dsa

• Cat ~/.ssh/id_dsa.pub >>
  ~/.ssh/authorized_keys

• 关闭防火墙Sudo ufw disable
分布式部署Core-site.xml
             (master&slave相同)
• <configuration>

• <property>
•      <name>hadoop.tmp.dir</name>
•      <value>/home/tony/tmp/tmp</value>
•      <description>Abaseforothertemporarydirectories.</description>
• </property>

• <property>
•      <name>fs.default.name</name>
•      <value>hdfs://192.168.0.10:9000</value>
• </property>

• </configuration>
分布式部署Hdfs-site.xml
               (master&slave)
•   <configuration>
•   <property>
•        <name>dfs.replication</name>
•        <value>1</value>
•      </property>
•   <property>
•        <name>dfs.name.dir</name>
•        <value>/home/tony/tmp/name</value>
•      </property>
•   <property>
•        <name>dfs.data.dir</name>
•        <value>/home/tony/tmp/data</value>
•      </property>
•   </configuration>
•   并且保证当前机器有该目录
分布式部署Mapred-site.xml
• <configuration>
• <property>
•     <name>mapred.job.tracker</name>
•     <value>192.168.0.10:9001</value>
•   </property>

• </configuration>
• 所有的机器都配成master的地址
Run
• Hadoop namenode –format
  – 每次fotmat前,先stop-all,并清空tmp一下的
    所有目录
• Start-all.sh 或 (start-dfs和start-mapred)
• 显示运行情况:
  – http://192.168.0.20:50070/dfshealth.jsp
  – 或 hadoop dfsadmin -report
could only be replicated
• java.io.IOException: could only be replicated
  to 0 nodes, instead of 1.

• 解决:
  – XML的配置不正确,要保证slave的mapred-
    site.xml和core-site.xml的地址都跟master一致
Incompatible namespaceIDs
• java.io.IOException: Incompatible
  namespaceIDs in /home/hadoop/data:
  namenode namespaceID = 1214734841;
  datanode namespaceID = 1600742075
• 原因:
  – 格式化前没清空tmp,导致ID不一致
• 解决:
  – 修改 namenode 的
    /home/hadoop/name/current/VERSION
UnknownHostException
• # hostname
• Vi /etc/hostname 修改hostname
• Vi /etc/hosts 增加hostname对应的IP
Name node is in safe mode
•   hadoop dfsadmin -safemode leave

•   safemode模式
    NameNode在启动的时候首先进入安全模式,如果datanode丢失的block达到一定的比例(1-
    dfs.safemode.threshold.pct),则系统会一直处于安全模式状态即只读状态。
    dfs.safemode.threshold.pct(缺省值0.999f)表示HDFS启动的时候,如果DataNode上报的block
    个数达到了元数据记录的block个数的0.999倍才可以离开安全模式,否则一直是这种只读模
    式。如果设为1则HDFS永远是处于SafeMode。
    下面这行摘录自NameNode启动时的日志(block上报比例1达到了阀值0.9990)
    The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off
    automatically in 18 seconds.
    hadoop dfsadmin -safemode leave
    有两个方法离开这种安全模式
    1. 修改dfs.safemode.threshold.pct为一个比较小的值,缺省是0.999。
    2. hadoop dfsadmin -safemode leave命令强制离开

•   用户可以通过dfsadmin -safemode value 来操作安全模式,参数value的说明如下:
    enter - 进入安全模式
    leave - 强制NameNode离开安全模式
    get - 返回安全模式是否开启的信息
    wait - 等待,一直到安全模式结束。
error in shuffle in fetcher
• org.apache.hadoop.mapreduce.task.reduce.Sh
  uffle$ShuffleError: error in shuffle in fetcher
• 解决方式:
  – 问题出在hosts文件的配置上,在所有节点的
    /etc/hosts文件中加入其他节点的主机名和IP映
    射
Auto sync
动态增加datanode
• 主机的conf/slaves中,增加namenode的地址
•
• 启动新增的namenode
 – bin/hadoop-daemon.sh start datanode
   bin/hadoop-daemon.sh start tasktracker
•
• 启动后,Hadoop自动识别。
screenshot
容错
• 如果一个节点很长时间没反应,就会清出
  集群,并且其它节点会把replication补上
执行 MapReduce
• hadoop jar a.jar com.Map1
  hdfs://192.168.126.133:9000/hadoopconf/
  hdfs://192.168.126.133:9000/output2/

• 状态:
• http://localhost:50030/
Read From Hadoop URL
•   //execute: hadoop ReadFromHDFS
•   public class ReadFromHDFS {
•      static {
•       URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
•     }
•      public static void main(String[] args){
•        try {
•   URL uri = new URL("hdfs://192.168.126.133:9000/t1/a1.txt");
•   IOUtils.copyBytes(uri.openStream(), System.out, 4096, false);
•   }catch (FileNotFoundException e) {
•   e.printStackTrace();
•        } catch (IOException e) {
•   e.printStackTrace();
•        }
•      }
•   }
Read By FileSystem API
•   //execute : hadoop ReadByFileSystemAPI
•   public class ReadByFileSystemAPI {
•      public static void main(String[] args) throws Exception {
•        String uri = ("hdfs://192.168.126.133:9000/t1/a2.txt");;
•        Configuration conf = new Configuration();
•        FileSystem fs = FileSystem.get(URI.create(uri), conf);
•        FSDataInputStream in = null;
•        try {
•   in = fs.open(new Path(uri));
•   IOUtils.copyBytes(in, System.out, 4096, false);
•        } finally {
•   IOUtils.closeStream(in);
•        }
•      }
•   }
FileSystemAPI
•   Path path = new Path(URI.create("hdfs://192.168.126.133:9000/t1/tt/"));
•   if(fs.exists(path)){
•      fs.delete(path,true);
•      System.out.println("deleted-----------");
•   }else{
•      fs.mkdirs(path);
•      System.out.println("creted=====");
•   }

•   /**
•    * List files
•    */
•   FileStatus[] fileStatuses = fs.listStatus(new Path(URI.create("hdfs://192.168.126.133:9000/")));
•   for(FileStatus fileStatus : fileStatuses){
•      System.out.println("" + fileStatus.getPath().toUri().toString() + " dir:" + fileStatus.isDirectory());
•   }

•   PathFilter pathFilter = new PathFilter(){
•      @Override
•      public boolean accept(Path path) {
•        return true;
•      }
•   };
文件写入策略
•   在创建一个文件之后,在文件系统的命名空间中是可见的,如下所示:
•   1. Path p = new Path("p");
•   2. Fs.create(p);
•   3. assertThat(fs.exists(p),is(true));
•   但是,写入文件的内容并不保证能被看见,即使数据流已经被刷新。所以文
    件长度显
•   示为0:
•   1. Path p = new Path("p");
•   2. OutputStream out = fs.create(p);
•   3. out.write("content".getBytes("UTF-8"));
•   4. out.flush();
•   5. assertThat(fs.getFileStatus(p).getLen(),is(0L));
•   一旦写入的数据超过一个块的数据,新的读取者就能看见第一个块。对于之
    后的块也
•   是这样。总之,它始终是当前正在被写入的块,其他读取者是看不见它的。
•   out.sync(); 强制同步, close()的时候会自动调用sync()
集群复制 归档
• hadoop distcp -update hdfs://n1/foo
  hdfs://n2/bar/foo
• 归档
  – hadoop archive -archiveName files.har /my/files
    /my
• 使用归档
  – hadoop fs -lsr har:///my/files.har
  – hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/di
• 归档缺点:修改文件、增加删除文件 都需重新归档
SequenceFile Reader&Writer
•   Configuration conf = new Configuration();
•       SequenceFile.Writer writer =null ;
•       try {
•         System.out.println("start....................");
•         FileSystem fileSystem = FileSystem.newInstance(conf);
•         IntWritable key = new IntWritable(1);
•         Text value = new Text("");
•         Path path = new Path("hdfs://192.168.126.133:9000/t1/seq");
•         if(!fileSystem.exists(path)){
•             fileSystem.create(path);
•             writer = SequenceFile.createWriter(fileSystem, conf, path, key.getClass(), value.getClass());

•            for(int i=1; i<10; i++){
•               writer.append(new IntWritable(i), new Text("value" + i));
•            }
•            writer.close();
•          }else{
•            SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem,path,conf);
•            System.out.println("now while segment");
•            while(reader.next(key, value)){
•               System.out.println("key:" + key.get() + " value:" + value + " position" + reader.getPosition());
•            };
•          }
•       } catch (IOException e) {
•          e.printStackTrace();
•       } finally{
•          IOUtils.closeStream(writer);
•       }
SequenceFile
•   1 value1
•   2 value2
•   3 value3
•   4 value4
•   5 value5
•   6 value6
•   7 value7
•   8 value8
•   9 value9
•   包括一个Key 和一个 Value
•   可以用hadoop fs –text hdfs://……… 来显示文件
SequenceMap
• 重建索引:
  MapFile.fix(fileSystem, path, key.getClass(), value.
  getClass(), true, conf);

• MapFile.Writer writer = new
  MapFile.Writer(conf, fileSystem, path.toString(), k
  ey.getClass(), value.getClass());

• MapFile.Reader reader = new
  MapFile.Reader(fileSystem,path.toString(),conf);
Mapper Test Case
•   @Test
•    public void testMapper1() throws IOException {
•      MyMapper myMapper = new MyMapper();
•      Text text = new Text("xxxxxx<<HelloWorld>>xxxxxxxxxxxxxxxxxx");
•      OutputCollector outputCollector = new OutputCollector<Text,IntWritable>(){
•         public void collect(Text resultKey, IntWritable resultValue) throws IOException {
•           System.out.println("resultKey:" + resultKey + " resultValue:" + resultValue);
•           Assert.assertTrue("HelloWorld" . equals(resultKey.toString()));
•         }
•      };
•      myMapper.map(null,text, outputCollector, null);
•    }

•   public class MyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
•     @Override
•     public void map(LongWritable longWritable, Text text, OutputCollector<Text, IntWritable>
    textIntWritableOutputCollector, Reporter reporter) throws IOException {
•        Text result = new Text(text.toString().split("<<")[1].split(">>")[0]);
•        textIntWritableOutputCollector.collect(result, new IntWritable(result.getLength()));
•     }
•   }
Mapper Test Case
•   @Test
•    public void testReducer1() throws IOException {
•      MyReducer myReducer = new MyReducer();
•      ArrayList arrayList = new ArrayList();
•      arrayList.add(new Text("a1")); arrayList.add(new Text("a222")); arrayList.add(new Text("a33"));
•      Iterator it = arrayList.iterator();
•      OutputCollector<Text,Text> outputCollector = new OutputCollector<Text,Text>(){
•         public void collect(Text resultKey, Text resultValue) throws IOException {
•           System.out.println("resultKey:" + resultKey + " resultValue:" + resultValue);
•           Assert.assertTrue(resultKey.toString().equals("a222"));
•         }
•      };
•      myReducer.reduce(null,it,outputCollector,null);
•    }

•   public class MyReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
•     public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
•       int sum = 0;
•       Text t = new Text();
•       while (values.hasNext()) {
•          Text tmp = values.next();
•          if (tmp.getLength() > t.getLength()) {
•              t = tmp;
•          }
•       }
•       output.collect(key, t);
•     }
•   }
Map Reduce 执行原理
• JobClient.submitJob()
• 1.向JobTracker申请一个任务ID
• 2.检查作业的输入输出目录是否存在或已存在
• 3.计算作业的输入划分,如果目录不存在就把
  错误返回给MapReduce程序
• 4.把作业运行的资源复制到JobTracker服务器的
  目录
• 5.通知jobtracker运行
Mapper输入 使用多种InputFormat
//match inputFormat by input path
•   MultipleInputs.addInputPath(conf, new
    Path(args[0]), KeyValueTextInputFormat.class, KVTempMapper.class);
•   MultipleInputs.addInputPath(conf, new
    Path("hdfs://192.168.126.133:9000/*.txt"), TextInputFormat.class, KVTempMapper.class);
MapReduce输出种类
• 节能
• 多输出:
 – 实现Partitioner

 – conf.setPartitionerClass(MyPartitioner.class);

 – 代码在备注
自义定输出目录
• public class MyOutputFormat extends
  MultipleTextOutputFormat {
• protected String generateFileNameForKeyValue(Object
  key, Object value, String name) {
•      return "abc.txt";
•   }
• }

• 运行时:
  conf.setOutputFormat(MyOutputFormat.class);
• 最后输出会输出到 目录的abc.txt文件
设置多个输出Format
•   MultipleOutputs.addNamedOutput(conf, "outputA", TextOutputFormat.class, Long
    Writable.class, Text.class);
•
    MultipleOutputs.addNamedOutput(conf, "outputB", MyOutputFormat.class, Long
    Writable.class, Text.class);



• MultipleOutputs 用完最后一定要关闭
• 需要覆写configure来获取JobConf

     – 代码在备注里
记数器

• mapper或reducer中
     – reporter.incrCounter(CounterType.Success, 1);
     – reporter.incrCounter("myGroup","name", 2);
• 作业完成时会打印出计数

• 程序获取Counter:
•   RunningJob runningJob = JobClient.runJob(conf);
•       JobClient jobClient = new JobClient(conf);
•       Counters.Counter counter =
    runningJob.getCounters().findCounter("myGroup", "counterA");
•       if(counter!=null){
•          System.out.println(counter.getCounter());
•       }
排序 & Join
• conf.setOutputFormat(MapFileOutputFormat.
  class);
Pig
• 避免编写MapReduce程序、编译打包、执行
• 运行:
  – 本地模式 : pig –x local
• export PIG_CLASSPATH=hadoop/conf

• 注释
  – /* xxx    */
  – -- xxxxxxxxxxxxxx 两个减号
Pig syntax
• raw = LOAD 'excite.log' 读取一个文件
    – USING PigStorage('t') 分隔符
    – AS (user:int, time:int, query:int); 变量及类型
•   register XXX.jar 使用JAR包
•   dump raw
•   describe raw 打印结构
•   explain raw
•   store raw into 'aaa.txt' 保存
Pig syntax
• filter
   – ccc = filter aaa by name is null and age>10
• Group
   – bbb = group aaa by myColumn
• Foreach&Generate
   – ddd = foreach bbb generate group, MAX(aaa.temp)
• Illustrate
   – ILLUSTRATE aaa 打印步骤
Pig 内置函数
• split XXX into a1 if temp is not null, a2 if temp
  is null
• PIG内置函数:
  – AVG, CONCAT, COUNT, DIFF, MAX, SIZE, SUM, TOK
    ENIZE
  – IsEmpty
  – PigStorage
Foreach
• data:
  – a, 1, hello
  – b, 2, hey
• execute:
  – foreach XXX generate $2, $1+10, $0
• result:
  – hello, 11, a
  – hey, 12, b
自定义函数 UDF Filter
•   filter XXX by isGood(year)
•   public class GoodPig extends FilterFunc{
•      public Boolean exec(Tuple tuple) ;
•   }
•   使用:
•   define isGood pig.GoodPig
自定义Pig函数 改变类型
• public class MyEvalFunc extends EvalFunc{
public List<FuncSpec> getArgToFuncMapping()
}
使用:
define myEvalFunc com.MyEvalFunc
foreach XXX generate myEvalFunc(aaa)
MyLoadFunc与存储的处理
• store XXX into 'out.txt' using PigStorage('==')
  – 输入: Hello==1==a


• 自定义LoadFunc
  – a1 = load 'xxx.txt' using com.MyLoadFunc() as
    (year:int, temp:int)
  – 代码在备注
Pig的Join
• aaa:
  – 1,hi
  – 2,hello
  – 3,nihao
• bbb:
  – a,2
  – b,3
  – c,1                               1   hi      c   1

                                      2   hello   a   2
  – xxx = join aaa by $0, bbb by $1
                                      3   nihao   b   3
Hive 简介
•   数据仓库,
•   把类似SQL的语法转化成MapReduce程序
•   不支持Index,Transaction,分钟级别的延时
•   不支持SQL的Having
•   数据类型支持
    – 基本类型string,int,double,boolean等
    – 复杂类型的Array,Map,Struct
Hive 数据仓库
•   % export HIVE_HOME=/home/my/hive
•   运行:bin/hive
•   hive> SHOW TABLES;
•   hive -f script.q
•   hive -e 'SELECT * FROM dummy'
Hive建表与载入数据
• 建表
• CREATE TABLE records
  – (year STRING, temperature INT, quality INT)
  – ROW FORMAT DELIMITED
  – FIELDS TERMINATED BY 't';
• 从文件载入:
  – LOAD DATA LOCAL INPATH 'input/ncdc/micro-
    tab/sample.txt'
  – OVERWRITE INTO TABLE records
Error while making MR scratch
                  directory
• 把hadoop的配置文件core-site.xml中的:
• fs.default.name 里的值改成 hosts里的名称
• 然后重启hadoop和hive

• 如果提示name node is in safe mod
  – hadoop dfsadmin -safemode leave

  –   或在hdfs上建立相关目录并加权限
  –   % hadoop fs -mkdir /tmp
  –   % hadoop fs -chmod a+w /tmp
  –   % hadoop fs -mkdir /user/hive/warehouse
  –   % hadoop fs -chmod a+w /user/hive/warehouse
Hive启动模式
• hive - - service hiveserver
Metastore
• metastore由两部分组成
 – service
 – 数据存储
复杂类型
•   CREATE TABLE complex (
•     col1 ARRAY<INT>,
•     col2 MAP<STRING, INT>,
•     col3 STRUCT<a:STRING, b:INT, c:DOUBLE>
•   );

• 查询:
• SELECT col1[0], col2['b'], col3.c FROM complex
托管表与外部表
 managed tables and External tables
• 托管表会移动数据到Hive的数据仓库目录
 – CREATE TABLE managed_table (dummy STRING);
 – LOAD DATA INPATH '/user/tom/data.txt' INTO table
   managed_table;
• 外部表:
 – CREATE EXTERNAL TABLE external_table (dummy
   STRING) LOCATION '/user/tom/external_table';
 – LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE
   external_table;
 – 删除外部表的时候不会删除数据,只删除metaata
Hive Partition分区
• 会按目录保存数据
 – /user/hive/warehouse/tab4/level=2/city=beijing/h2.tx
   t (红色部分是partition)
• 建表
 – CREATE TABLE logs (ts BIGINT, line STRING)
 – PARTITIONED BY (dt STRING, country STRING);
• 使用:
 – LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
 – INTO TABLE logs
 – PARTITION (dt='2001-01-01', country='GB');
Hive Buckets
• CREATE TABLE bucketed_users (id INT, name
  STRING)
• CLUSTERED BY (id) INTO 4 BUCKETS;

• 分隔成4片,用于拆分成多个MapReduce任
  务
分隔符
•   CREATE TABLE ...
•   ROW FORMAT DELIMITED
•    FIELDS TERMINATED BY '001'
•    COLLECTION ITEMS TERMINATED BY '002'
•    MAP KEYS TERMINATED BY '003'
•    LINES TERMINATED BY 'n'
•   STORED AS TEXTFILE;
指定序列化反序列化
• CREATE TABLE stations (usaf STRING, wban STRING, name STRING)
• ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
• WITH SERDEPROPERTIES (
• "input.regex" = "(d{6}) (d{5}) (.{29}) .*"
• );

•   hive> SELECT * FROM stations LIMIT 4;
•   010000 99999 BOGUS NORWAY
•   010003 99999 BOGUS NORWAY
•   010010 99999 JAN MAYEN
•   010013 99999 ROST
表命令
• create table xxx as select name,age from tab2
• ALTER TABLE source RENAME TO target;
• ALTER TABLE target ADD COLUMNS (col3
  STRING);
• create table XXX as select c1,c2 from Tab2
自定义函数UDF
• select myFun(age) from tab3;

• public class MyFun extends UDF {
• }

• 编写完以后注册 :
  – create temporary function myFun as 'com.MyFun'
自定义组聚集函数UDAF
• extends UDAF
Hive 简介
•   数据仓库,
•   把类似SQL的语法转化成MapReduce程序
•   不支持Index,Transaction,分钟级别的延时
•   不支持SQL的Having
•   数据类型支持
    – 基本类型string,int,double,boolean等
    – 复杂类型的Array,Map,Struct
Hive 数据仓库
•   % export HIVE_HOME=/home/my/hive
•   运行:bin/hive
•   hive> SHOW TABLES;
•   hive -f script.q
•   hive -e 'SELECT * FROM dummy'
Hive建表与载入数据
• 建表
• CREATE TABLE records
  – (year STRING, temperature INT, quality INT)
  – ROW FORMAT DELIMITED
  – FIELDS TERMINATED BY 't';
• 从文件载入:
  – LOAD DATA LOCAL INPATH 'input/ncdc/micro-
    tab/sample.txt'
  – OVERWRITE INTO TABLE records
Error while making MR scratch
                  directory
• 把hadoop的配置文件core-site.xml中的:
• fs.default.name 里的值改成 hosts里的名称
• 然后重启hadoop和hive

• 如果提示name node is in safe mod
  – hadoop dfsadmin -safemode leave

  –   或在hdfs上建立相关目录并加权限
  –   % hadoop fs -mkdir /tmp
  –   % hadoop fs -chmod a+w /tmp
  –   % hadoop fs -mkdir /user/hive/warehouse
  –   % hadoop fs -chmod a+w /user/hive/warehouse
Hive启动模式
• hive - - service hiveserver
Metastore
• metastore由两部分组成
 – service
 – 数据存储
复杂类型
•   CREATE TABLE complex (
•     col1 ARRAY<INT>,
•     col2 MAP<STRING, INT>,
•     col3 STRUCT<a:STRING, b:INT, c:DOUBLE>
•   );

• 查询:
• SELECT col1[0], col2['b'], col3.c FROM complex
托管表与外部表
 managed tables and External tables
• 托管表会移动数据到Hive的数据仓库目录
 – CREATE TABLE managed_table (dummy STRING);
 – LOAD DATA INPATH '/user/tom/data.txt' INTO table
   managed_table;
• 外部表:
 – CREATE EXTERNAL TABLE external_table (dummy
   STRING) LOCATION '/user/tom/external_table';
 – LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE
   external_table;
 – 删除外部表的时候不会删除数据,只删除metaata
Hive Partition分区
• 会按目录保存数据
 – /user/hive/warehouse/tab4/level=2/city=beijing/h2.tx
   t (红色部分是partition)
• 建表
 – CREATE TABLE logs (ts BIGINT, line STRING)
 – PARTITIONED BY (dt STRING, country STRING);
• 使用:
 – LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
 – INTO TABLE logs
 – PARTITION (dt='2001-01-01', country='GB');
Hive Buckets
• CREATE TABLE bucketed_users (id INT, name
  STRING)
• CLUSTERED BY (id) INTO 4 BUCKETS;

• 分隔成4片,用于拆分成多个MapReduce任
  务
分隔符
•   CREATE TABLE ...
•   ROW FORMAT DELIMITED
•    FIELDS TERMINATED BY '001'
•    COLLECTION ITEMS TERMINATED BY '002'
•    MAP KEYS TERMINATED BY '003'
•    LINES TERMINATED BY 'n'
•   STORED AS TEXTFILE;
指定序列化反序列化
• CREATE TABLE stations (usaf STRING, wban STRING, name STRING)
• ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
• WITH SERDEPROPERTIES (
• "input.regex" = "(d{6}) (d{5}) (.{29}) .*"
• );

•   hive> SELECT * FROM stations LIMIT 4;
•   010000 99999 BOGUS NORWAY
•   010003 99999 BOGUS NORWAY
•   010010 99999 JAN MAYEN
•   010013 99999 ROST
表命令
• create table xxx as select name,age from tab2
• ALTER TABLE source RENAME TO target;
• ALTER TABLE target ADD COLUMNS (col3
  STRING);
• create table XXX as select c1,c2 from Tab2
自定义函数UDF
• select myFun(age) from tab3;

• public class MyFun extends UDF {
• }

• 编写完以后注册 :
  – create temporary function myFun as 'com.MyFun'
自定义组聚集函数UDAF
• extends UDAF
HBase
•   start-hbase.sh
•   hbase shell
•   create 'tab1','col'
•   list 显示表
•   put 'tab1','row1', 'col:name', 'XiaoMing'
•   put 'tab1', 'row1', 'col:age', '10'
•   put 'tab2', 'row2', 'col:name', 'DaMing'
•   删除表
    – disable 'tab1'
    – drop 'tab1'
HBase API Get
•   @Test
•      public void testGet() throws IOException {
•        Configuration conf = HBaseConfiguration.create();
•   //     conf.set("hbase.master.port", "localhost:PORT");
•   //     conf.set("hbase.zookeeper.quorum", "IP");
•        HTable table = new HTable(conf, "tab1");
•        Get get = new Get(Bytes.toBytes("r1"));
•        get.addColumn(Bytes.toBytes("col"), Bytes.toBytes("name"));
•        Result result = table.get(get);
•        byte[] value = result.value();
•        System.out.println("v:" + Bytes.toString(value));
•        byte[] val = result.getValue(Bytes.toBytes("col"), Bytes.toBytes("name"));
•        System.out.println("Value: " + Bytes.toString(val));
•   }

More Related Content

What's hot

Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparisonarunkumar sadhasivam
 
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...NETWAYS
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installationAnkit Desai
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteAllen Wittenauer
 
Perl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingPerl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingDanairat Thanabodithammachari
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809Tim Bunce
 
Hadoop installation on windows
Hadoop installation on windows Hadoop installation on windows
Hadoop installation on windows habeebulla g
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会Toshihiro Suzuki
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified LoggingGabor Kozma
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaionTejalNijai
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Tim Bunce
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Nag Arvind Gudiseva
 
Course 102: Lecture 3: Basic Concepts And Commands
Course 102: Lecture 3: Basic Concepts And Commands Course 102: Lecture 3: Basic Concepts And Commands
Course 102: Lecture 3: Basic Concepts And Commands Ahmed El-Arabawy
 

What's hot (20)

Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparison
 
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell Rewrite
 
Perl Programming - 03 Programming File
Perl Programming - 03 Programming FilePerl Programming - 03 Programming File
Perl Programming - 03 Programming File
 
Perl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingPerl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File Processing
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Hadoop
HadoopHadoop
Hadoop
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
 
Hdfs java api
Hdfs java apiHdfs java api
Hdfs java api
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809
 
Hadoop installation on windows
Hadoop installation on windows Hadoop installation on windows
Hadoop installation on windows
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified Logging
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaion
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Perl Memory Use - LPW2013
Perl Memory Use - LPW2013
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
 
Course 102: Lecture 3: Basic Concepts And Commands
Course 102: Lecture 3: Basic Concepts And Commands Course 102: Lecture 3: Basic Concepts And Commands
Course 102: Lecture 3: Basic Concepts And Commands
 

Similar to Hadoop 20111215

Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing DaeHyung Lee
 
Haiku OS Presentation
Haiku OS PresentationHaiku OS Presentation
Haiku OS Presentationlaawrence
 
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)Adin Ermie
 
Jak se ^bonami\.(cz|pl|sk)$ vešlo do kontejneru
Jak se ^bonami\.(cz|pl|sk)$ vešlo do kontejneruJak se ^bonami\.(cz|pl|sk)$ vešlo do kontejneru
Jak se ^bonami\.(cz|pl|sk)$ vešlo do kontejneruVašek Boch
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Source Code of Building Linux IPv6 DNS Server (Complete Sourcecode)
Source Code of Building Linux IPv6 DNS Server (Complete Sourcecode)Source Code of Building Linux IPv6 DNS Server (Complete Sourcecode)
Source Code of Building Linux IPv6 DNS Server (Complete Sourcecode)Hari
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFSApache Apex
 
WordPress Development Environments
WordPress Development Environments WordPress Development Environments
WordPress Development Environments Ohad Raz
 
Ashish pandey huawei osi_days2011_cgroups_understanding_better
Ashish pandey huawei osi_days2011_cgroups_understanding_betterAshish pandey huawei osi_days2011_cgroups_understanding_better
Ashish pandey huawei osi_days2011_cgroups_understanding_bettersuniltomar04
 
Railsconf2011 deployment tips_for_slideshare
Railsconf2011 deployment tips_for_slideshareRailsconf2011 deployment tips_for_slideshare
Railsconf2011 deployment tips_for_slidesharetomcopeland
 
nodejs_at_a_glance.ppt
nodejs_at_a_glance.pptnodejs_at_a_glance.ppt
nodejs_at_a_glance.pptWalaSidhom1
 
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios
 
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System IIAndrea PETRUCCI
 
2017-03-11 02 Денис Нелюбин. Docker & Ansible - лучшие друзья DevOps
2017-03-11 02 Денис Нелюбин. Docker & Ansible - лучшие друзья DevOps2017-03-11 02 Денис Нелюбин. Docker & Ansible - лучшие друзья DevOps
2017-03-11 02 Денис Нелюбин. Docker & Ansible - лучшие друзья DevOpsОмские ИТ-субботники
 
All I Need to Know I Learned by Writing My Own Web Framework
All I Need to Know I Learned by Writing My Own Web FrameworkAll I Need to Know I Learned by Writing My Own Web Framework
All I Need to Know I Learned by Writing My Own Web FrameworkBen Scofield
 

Similar to Hadoop 20111215 (20)

Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
Puppet
PuppetPuppet
Puppet
 
Interacting with hdfs
Interacting with hdfsInteracting with hdfs
Interacting with hdfs
 
2 technical-dns-workshop-day1
2 technical-dns-workshop-day12 technical-dns-workshop-day1
2 technical-dns-workshop-day1
 
Haiku OS Presentation
Haiku OS PresentationHaiku OS Presentation
Haiku OS Presentation
 
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
 
Jak se ^bonami\.(cz|pl|sk)$ vešlo do kontejneru
Jak se ^bonami\.(cz|pl|sk)$ vešlo do kontejneruJak se ^bonami\.(cz|pl|sk)$ vešlo do kontejneru
Jak se ^bonami\.(cz|pl|sk)$ vešlo do kontejneru
 
Belvedere
BelvedereBelvedere
Belvedere
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Source Code of Building Linux IPv6 DNS Server (Complete Sourcecode)
Source Code of Building Linux IPv6 DNS Server (Complete Sourcecode)Source Code of Building Linux IPv6 DNS Server (Complete Sourcecode)
Source Code of Building Linux IPv6 DNS Server (Complete Sourcecode)
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
WordPress Development Environments
WordPress Development Environments WordPress Development Environments
WordPress Development Environments
 
Ashish pandey huawei osi_days2011_cgroups_understanding_better
Ashish pandey huawei osi_days2011_cgroups_understanding_betterAshish pandey huawei osi_days2011_cgroups_understanding_better
Ashish pandey huawei osi_days2011_cgroups_understanding_better
 
Railsconf2011 deployment tips_for_slideshare
Railsconf2011 deployment tips_for_slideshareRailsconf2011 deployment tips_for_slideshare
Railsconf2011 deployment tips_for_slideshare
 
nodejs_at_a_glance.ppt
nodejs_at_a_glance.pptnodejs_at_a_glance.ppt
nodejs_at_a_glance.ppt
 
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
 
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System II
 
2017-03-11 02 Денис Нелюбин. Docker & Ansible - лучшие друзья DevOps
2017-03-11 02 Денис Нелюбин. Docker & Ansible - лучшие друзья DevOps2017-03-11 02 Денис Нелюбин. Docker & Ansible - лучшие друзья DevOps
2017-03-11 02 Денис Нелюбин. Docker & Ansible - лучшие друзья DevOps
 
All I Need to Know I Learned by Writing My Own Web Framework
All I Need to Know I Learned by Writing My Own Web FrameworkAll I Need to Know I Learned by Writing My Own Web Framework
All I Need to Know I Learned by Writing My Own Web Framework
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

Hadoop 20111215

  • 2. • Hadoop Core, our flagship sub- project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor. • Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
  • 3. ZooKeeper • ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.
  • 4. JobTracker • JobTracker: The JobTracker provides command and control for job management. It supplies the primary user interface to a MapReduce cluster. It also handles the distribution and management of tasks. There is one instance of this server running on a cluster. The machine running the JobTracker server is the MapReduce master.
  • 5. TaskTracker • TaskTracker: The TaskTracker provides execution services for the submitted jobs. Each TaskTracker manages the execution of tasks on an individual compute node in the MapReduce cluster. The JobTracker manages all of the TaskTracker processes. There is one instance of this server per compute node.
  • 6. NameNode • NameNode: The NameNode provides metadata storage for the shared file system. The NameNode supplies the primary user interface to the HDFS. It also manages all of the metadata for the HDFS. There is one instance of this server running on a cluster. The metadata includes such critical information as the file directory structure and which DataNodes have copies of the data blocks that contain each file’s data. The machine running the NameNode server process is the HDFS master.
  • 7. Secondary NameNode • Secondary NameNode: The secondary NameNode provides both file system metadata backup and metadata compaction. It supplies near real-time backup of the metadata for the NameNode. There is at least one instance of this server running on a cluster, ideally on a separate physical machine than the one running the NameNode. The secondary NameNode also merges the metadata change history, the edit log, into the NameNode’s file system image.
  • 8. Design of HDFS • Design of HDFS – Very large files – Streaming data access – Commodity hardware • not a good fit – Low-latency data access – Lots of small files – Multiple writers, arbitrary file modifications
  • 9. blocks • normally 512 bytes • HDFS : 64 MB by default
  • 10. HDFS文件读取 内存 •
  • 13. HDFS文件写入 • Outputsream.write() • Outputstream.flush() 刷新,超过一个block 的时候,才会读到。 • Outputstream.sync() 强制同步 • Outputstream.close() 包括sync()
  • 14. DistCp分布式复制 • hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar • hadoop distcp –update …… – 只更新修改过的文件 • hadoop distcp –overwrite …… – 覆盖 • hadoop distcp –m 100 …… – 复制任务被分成N个MAP执行
  • 15. Hadoop 文件归档 • Har文件 • Hadoop archive –archiveName file.har /myfiles /outpath • Hadoop fs –ls /outpath/file.har • Hadoop fs –lsr har:///outpath/file.har
  • 16. 文件操作 • Hadoop fs –rm hdfs://192.168.126.133:9000/xxx •cat •cp •lsr •rmr •chgrp •du •mkdir •setrep •chmod •dus •moveFromLocal •stat •chown •expunge •moveToLocal •tail •copyFromLocal •get •mv •test •copyToLocal •getmerge •put •text •count •ls •rm •touchz
  • 17. 分布式部署 • Master&slave 192.168.0.10 • Slave 192.168.0.20 • 修改conf/master – 192.168.0.10 • 修改Conf/slave – 192.168.0.10 – 192.168.0.20
  • 18. 安装hadoop • ssh-keygen-tdsa –P '‘ –f ~/.ssh/id_dsa • Cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys • 关闭防火墙Sudo ufw disable
  • 19. 分布式部署Core-site.xml (master&slave相同) • <configuration> • <property> • <name>hadoop.tmp.dir</name> • <value>/home/tony/tmp/tmp</value> • <description>Abaseforothertemporarydirectories.</description> • </property> • <property> • <name>fs.default.name</name> • <value>hdfs://192.168.0.10:9000</value> • </property> • </configuration>
  • 20. 分布式部署Hdfs-site.xml (master&slave) • <configuration> • <property> • <name>dfs.replication</name> • <value>1</value> • </property> • <property> • <name>dfs.name.dir</name> • <value>/home/tony/tmp/name</value> • </property> • <property> • <name>dfs.data.dir</name> • <value>/home/tony/tmp/data</value> • </property> • </configuration> • 并且保证当前机器有该目录
  • 21. 分布式部署Mapred-site.xml • <configuration> • <property> • <name>mapred.job.tracker</name> • <value>192.168.0.10:9001</value> • </property> • </configuration> • 所有的机器都配成master的地址
  • 22. Run • Hadoop namenode –format – 每次fotmat前,先stop-all,并清空tmp一下的 所有目录 • Start-all.sh 或 (start-dfs和start-mapred) • 显示运行情况: – http://192.168.0.20:50070/dfshealth.jsp – 或 hadoop dfsadmin -report
  • 23.
  • 24.
  • 25. could only be replicated • java.io.IOException: could only be replicated to 0 nodes, instead of 1. • 解决: – XML的配置不正确,要保证slave的mapred- site.xml和core-site.xml的地址都跟master一致
  • 26. Incompatible namespaceIDs • java.io.IOException: Incompatible namespaceIDs in /home/hadoop/data: namenode namespaceID = 1214734841; datanode namespaceID = 1600742075 • 原因: – 格式化前没清空tmp,导致ID不一致 • 解决: – 修改 namenode 的 /home/hadoop/name/current/VERSION
  • 27. UnknownHostException • # hostname • Vi /etc/hostname 修改hostname • Vi /etc/hosts 增加hostname对应的IP
  • 28. Name node is in safe mode • hadoop dfsadmin -safemode leave • safemode模式 NameNode在启动的时候首先进入安全模式,如果datanode丢失的block达到一定的比例(1- dfs.safemode.threshold.pct),则系统会一直处于安全模式状态即只读状态。 dfs.safemode.threshold.pct(缺省值0.999f)表示HDFS启动的时候,如果DataNode上报的block 个数达到了元数据记录的block个数的0.999倍才可以离开安全模式,否则一直是这种只读模 式。如果设为1则HDFS永远是处于SafeMode。 下面这行摘录自NameNode启动时的日志(block上报比例1达到了阀值0.9990) The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 18 seconds. hadoop dfsadmin -safemode leave 有两个方法离开这种安全模式 1. 修改dfs.safemode.threshold.pct为一个比较小的值,缺省是0.999。 2. hadoop dfsadmin -safemode leave命令强制离开 • 用户可以通过dfsadmin -safemode value 来操作安全模式,参数value的说明如下: enter - 进入安全模式 leave - 强制NameNode离开安全模式 get - 返回安全模式是否开启的信息 wait - 等待,一直到安全模式结束。
  • 29. error in shuffle in fetcher • org.apache.hadoop.mapreduce.task.reduce.Sh uffle$ShuffleError: error in shuffle in fetcher • 解决方式: – 问题出在hosts文件的配置上,在所有节点的 /etc/hosts文件中加入其他节点的主机名和IP映 射
  • 30.
  • 32. 动态增加datanode • 主机的conf/slaves中,增加namenode的地址 • • 启动新增的namenode – bin/hadoop-daemon.sh start datanode bin/hadoop-daemon.sh start tasktracker • • 启动后,Hadoop自动识别。
  • 34. 容错 • 如果一个节点很长时间没反应,就会清出 集群,并且其它节点会把replication补上
  • 35.
  • 36. 执行 MapReduce • hadoop jar a.jar com.Map1 hdfs://192.168.126.133:9000/hadoopconf/ hdfs://192.168.126.133:9000/output2/ • 状态: • http://localhost:50030/
  • 37. Read From Hadoop URL • //execute: hadoop ReadFromHDFS • public class ReadFromHDFS { • static { • URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); • } • public static void main(String[] args){ • try { • URL uri = new URL("hdfs://192.168.126.133:9000/t1/a1.txt"); • IOUtils.copyBytes(uri.openStream(), System.out, 4096, false); • }catch (FileNotFoundException e) { • e.printStackTrace(); • } catch (IOException e) { • e.printStackTrace(); • } • } • }
  • 38. Read By FileSystem API • //execute : hadoop ReadByFileSystemAPI • public class ReadByFileSystemAPI { • public static void main(String[] args) throws Exception { • String uri = ("hdfs://192.168.126.133:9000/t1/a2.txt");; • Configuration conf = new Configuration(); • FileSystem fs = FileSystem.get(URI.create(uri), conf); • FSDataInputStream in = null; • try { • in = fs.open(new Path(uri)); • IOUtils.copyBytes(in, System.out, 4096, false); • } finally { • IOUtils.closeStream(in); • } • } • }
  • 39. FileSystemAPI • Path path = new Path(URI.create("hdfs://192.168.126.133:9000/t1/tt/")); • if(fs.exists(path)){ • fs.delete(path,true); • System.out.println("deleted-----------"); • }else{ • fs.mkdirs(path); • System.out.println("creted====="); • } • /** • * List files • */ • FileStatus[] fileStatuses = fs.listStatus(new Path(URI.create("hdfs://192.168.126.133:9000/"))); • for(FileStatus fileStatus : fileStatuses){ • System.out.println("" + fileStatus.getPath().toUri().toString() + " dir:" + fileStatus.isDirectory()); • } • PathFilter pathFilter = new PathFilter(){ • @Override • public boolean accept(Path path) { • return true; • } • };
  • 40. 文件写入策略 • 在创建一个文件之后,在文件系统的命名空间中是可见的,如下所示: • 1. Path p = new Path("p"); • 2. Fs.create(p); • 3. assertThat(fs.exists(p),is(true)); • 但是,写入文件的内容并不保证能被看见,即使数据流已经被刷新。所以文 件长度显 • 示为0: • 1. Path p = new Path("p"); • 2. OutputStream out = fs.create(p); • 3. out.write("content".getBytes("UTF-8")); • 4. out.flush(); • 5. assertThat(fs.getFileStatus(p).getLen(),is(0L)); • 一旦写入的数据超过一个块的数据,新的读取者就能看见第一个块。对于之 后的块也 • 是这样。总之,它始终是当前正在被写入的块,其他读取者是看不见它的。 • out.sync(); 强制同步, close()的时候会自动调用sync()
  • 41. 集群复制 归档 • hadoop distcp -update hdfs://n1/foo hdfs://n2/bar/foo • 归档 – hadoop archive -archiveName files.har /my/files /my • 使用归档 – hadoop fs -lsr har:///my/files.har – hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/di • 归档缺点:修改文件、增加删除文件 都需重新归档
  • 42. SequenceFile Reader&Writer • Configuration conf = new Configuration(); • SequenceFile.Writer writer =null ; • try { • System.out.println("start...................."); • FileSystem fileSystem = FileSystem.newInstance(conf); • IntWritable key = new IntWritable(1); • Text value = new Text(""); • Path path = new Path("hdfs://192.168.126.133:9000/t1/seq"); • if(!fileSystem.exists(path)){ • fileSystem.create(path); • writer = SequenceFile.createWriter(fileSystem, conf, path, key.getClass(), value.getClass()); • for(int i=1; i<10; i++){ • writer.append(new IntWritable(i), new Text("value" + i)); • } • writer.close(); • }else{ • SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem,path,conf); • System.out.println("now while segment"); • while(reader.next(key, value)){ • System.out.println("key:" + key.get() + " value:" + value + " position" + reader.getPosition()); • }; • } • } catch (IOException e) { • e.printStackTrace(); • } finally{ • IOUtils.closeStream(writer); • }
  • 43. SequenceFile • 1 value1 • 2 value2 • 3 value3 • 4 value4 • 5 value5 • 6 value6 • 7 value7 • 8 value8 • 9 value9 • 包括一个Key 和一个 Value • 可以用hadoop fs –text hdfs://……… 来显示文件
  • 44. SequenceMap • 重建索引: MapFile.fix(fileSystem, path, key.getClass(), value. getClass(), true, conf); • MapFile.Writer writer = new MapFile.Writer(conf, fileSystem, path.toString(), k ey.getClass(), value.getClass()); • MapFile.Reader reader = new MapFile.Reader(fileSystem,path.toString(),conf);
  • 45. Mapper Test Case • @Test • public void testMapper1() throws IOException { • MyMapper myMapper = new MyMapper(); • Text text = new Text("xxxxxx<<HelloWorld>>xxxxxxxxxxxxxxxxxx"); • OutputCollector outputCollector = new OutputCollector<Text,IntWritable>(){ • public void collect(Text resultKey, IntWritable resultValue) throws IOException { • System.out.println("resultKey:" + resultKey + " resultValue:" + resultValue); • Assert.assertTrue("HelloWorld" . equals(resultKey.toString())); • } • }; • myMapper.map(null,text, outputCollector, null); • } • public class MyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { • @Override • public void map(LongWritable longWritable, Text text, OutputCollector<Text, IntWritable> textIntWritableOutputCollector, Reporter reporter) throws IOException { • Text result = new Text(text.toString().split("<<")[1].split(">>")[0]); • textIntWritableOutputCollector.collect(result, new IntWritable(result.getLength())); • } • }
  • 46. Mapper Test Case • @Test • public void testReducer1() throws IOException { • MyReducer myReducer = new MyReducer(); • ArrayList arrayList = new ArrayList(); • arrayList.add(new Text("a1")); arrayList.add(new Text("a222")); arrayList.add(new Text("a33")); • Iterator it = arrayList.iterator(); • OutputCollector<Text,Text> outputCollector = new OutputCollector<Text,Text>(){ • public void collect(Text resultKey, Text resultValue) throws IOException { • System.out.println("resultKey:" + resultKey + " resultValue:" + resultValue); • Assert.assertTrue(resultKey.toString().equals("a222")); • } • }; • myReducer.reduce(null,it,outputCollector,null); • } • public class MyReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { • public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { • int sum = 0; • Text t = new Text(); • while (values.hasNext()) { • Text tmp = values.next(); • if (tmp.getLength() > t.getLength()) { • t = tmp; • } • } • output.collect(key, t); • } • }
  • 47. Map Reduce 执行原理 • JobClient.submitJob() • 1.向JobTracker申请一个任务ID • 2.检查作业的输入输出目录是否存在或已存在 • 3.计算作业的输入划分,如果目录不存在就把 错误返回给MapReduce程序 • 4.把作业运行的资源复制到JobTracker服务器的 目录 • 5.通知jobtracker运行
  • 48. Mapper输入 使用多种InputFormat //match inputFormat by input path • MultipleInputs.addInputPath(conf, new Path(args[0]), KeyValueTextInputFormat.class, KVTempMapper.class); • MultipleInputs.addInputPath(conf, new Path("hdfs://192.168.126.133:9000/*.txt"), TextInputFormat.class, KVTempMapper.class);
  • 49. MapReduce输出种类 • 节能 • 多输出: – 实现Partitioner – conf.setPartitionerClass(MyPartitioner.class); – 代码在备注
  • 50. 自义定输出目录 • public class MyOutputFormat extends MultipleTextOutputFormat { • protected String generateFileNameForKeyValue(Object key, Object value, String name) { • return "abc.txt"; • } • } • 运行时: conf.setOutputFormat(MyOutputFormat.class); • 最后输出会输出到 目录的abc.txt文件
  • 51. 设置多个输出Format • MultipleOutputs.addNamedOutput(conf, "outputA", TextOutputFormat.class, Long Writable.class, Text.class); • MultipleOutputs.addNamedOutput(conf, "outputB", MyOutputFormat.class, Long Writable.class, Text.class); • MultipleOutputs 用完最后一定要关闭 • 需要覆写configure来获取JobConf – 代码在备注里
  • 52. 记数器 • mapper或reducer中 – reporter.incrCounter(CounterType.Success, 1); – reporter.incrCounter("myGroup","name", 2); • 作业完成时会打印出计数 • 程序获取Counter: • RunningJob runningJob = JobClient.runJob(conf); • JobClient jobClient = new JobClient(conf); • Counters.Counter counter = runningJob.getCounters().findCounter("myGroup", "counterA"); • if(counter!=null){ • System.out.println(counter.getCounter()); • }
  • 53. 排序 & Join • conf.setOutputFormat(MapFileOutputFormat. class);
  • 54. Pig • 避免编写MapReduce程序、编译打包、执行 • 运行: – 本地模式 : pig –x local • export PIG_CLASSPATH=hadoop/conf • 注释 – /* xxx */ – -- xxxxxxxxxxxxxx 两个减号
  • 55. Pig syntax • raw = LOAD 'excite.log' 读取一个文件 – USING PigStorage('t') 分隔符 – AS (user:int, time:int, query:int); 变量及类型 • register XXX.jar 使用JAR包 • dump raw • describe raw 打印结构 • explain raw • store raw into 'aaa.txt' 保存
  • 56.
  • 57. Pig syntax • filter – ccc = filter aaa by name is null and age>10 • Group – bbb = group aaa by myColumn • Foreach&Generate – ddd = foreach bbb generate group, MAX(aaa.temp) • Illustrate – ILLUSTRATE aaa 打印步骤
  • 58. Pig 内置函数 • split XXX into a1 if temp is not null, a2 if temp is null • PIG内置函数: – AVG, CONCAT, COUNT, DIFF, MAX, SIZE, SUM, TOK ENIZE – IsEmpty – PigStorage
  • 59. Foreach • data: – a, 1, hello – b, 2, hey • execute: – foreach XXX generate $2, $1+10, $0 • result: – hello, 11, a – hey, 12, b
  • 60. 自定义函数 UDF Filter • filter XXX by isGood(year) • public class GoodPig extends FilterFunc{ • public Boolean exec(Tuple tuple) ; • } • 使用: • define isGood pig.GoodPig
  • 61. 自定义Pig函数 改变类型 • public class MyEvalFunc extends EvalFunc{ public List<FuncSpec> getArgToFuncMapping() } 使用: define myEvalFunc com.MyEvalFunc foreach XXX generate myEvalFunc(aaa)
  • 62. MyLoadFunc与存储的处理 • store XXX into 'out.txt' using PigStorage('==') – 输入: Hello==1==a • 自定义LoadFunc – a1 = load 'xxx.txt' using com.MyLoadFunc() as (year:int, temp:int) – 代码在备注
  • 63. Pig的Join • aaa: – 1,hi – 2,hello – 3,nihao • bbb: – a,2 – b,3 – c,1 1 hi c 1 2 hello a 2 – xxx = join aaa by $0, bbb by $1 3 nihao b 3
  • 64. Hive 简介 • 数据仓库, • 把类似SQL的语法转化成MapReduce程序 • 不支持Index,Transaction,分钟级别的延时 • 不支持SQL的Having • 数据类型支持 – 基本类型string,int,double,boolean等 – 复杂类型的Array,Map,Struct
  • 65. Hive 数据仓库 • % export HIVE_HOME=/home/my/hive • 运行:bin/hive • hive> SHOW TABLES; • hive -f script.q • hive -e 'SELECT * FROM dummy'
  • 66. Hive建表与载入数据 • 建表 • CREATE TABLE records – (year STRING, temperature INT, quality INT) – ROW FORMAT DELIMITED – FIELDS TERMINATED BY 't'; • 从文件载入: – LOAD DATA LOCAL INPATH 'input/ncdc/micro- tab/sample.txt' – OVERWRITE INTO TABLE records
  • 67. Error while making MR scratch directory • 把hadoop的配置文件core-site.xml中的: • fs.default.name 里的值改成 hosts里的名称 • 然后重启hadoop和hive • 如果提示name node is in safe mod – hadoop dfsadmin -safemode leave – 或在hdfs上建立相关目录并加权限 – % hadoop fs -mkdir /tmp – % hadoop fs -chmod a+w /tmp – % hadoop fs -mkdir /user/hive/warehouse – % hadoop fs -chmod a+w /user/hive/warehouse
  • 68. Hive启动模式 • hive - - service hiveserver
  • 70. 复杂类型 • CREATE TABLE complex ( • col1 ARRAY<INT>, • col2 MAP<STRING, INT>, • col3 STRUCT<a:STRING, b:INT, c:DOUBLE> • ); • 查询: • SELECT col1[0], col2['b'], col3.c FROM complex
  • 71. 托管表与外部表 managed tables and External tables • 托管表会移动数据到Hive的数据仓库目录 – CREATE TABLE managed_table (dummy STRING); – LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table; • 外部表: – CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; – LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table; – 删除外部表的时候不会删除数据,只删除metaata
  • 72. Hive Partition分区 • 会按目录保存数据 – /user/hive/warehouse/tab4/level=2/city=beijing/h2.tx t (红色部分是partition) • 建表 – CREATE TABLE logs (ts BIGINT, line STRING) – PARTITIONED BY (dt STRING, country STRING); • 使用: – LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' – INTO TABLE logs – PARTITION (dt='2001-01-01', country='GB');
  • 73. Hive Buckets • CREATE TABLE bucketed_users (id INT, name STRING) • CLUSTERED BY (id) INTO 4 BUCKETS; • 分隔成4片,用于拆分成多个MapReduce任 务
  • 74. 分隔符 • CREATE TABLE ... • ROW FORMAT DELIMITED • FIELDS TERMINATED BY '001' • COLLECTION ITEMS TERMINATED BY '002' • MAP KEYS TERMINATED BY '003' • LINES TERMINATED BY 'n' • STORED AS TEXTFILE;
  • 75. 指定序列化反序列化 • CREATE TABLE stations (usaf STRING, wban STRING, name STRING) • ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' • WITH SERDEPROPERTIES ( • "input.regex" = "(d{6}) (d{5}) (.{29}) .*" • ); • hive> SELECT * FROM stations LIMIT 4; • 010000 99999 BOGUS NORWAY • 010003 99999 BOGUS NORWAY • 010010 99999 JAN MAYEN • 010013 99999 ROST
  • 76. 表命令 • create table xxx as select name,age from tab2 • ALTER TABLE source RENAME TO target; • ALTER TABLE target ADD COLUMNS (col3 STRING); • create table XXX as select c1,c2 from Tab2
  • 77. 自定义函数UDF • select myFun(age) from tab3; • public class MyFun extends UDF { • } • 编写完以后注册 : – create temporary function myFun as 'com.MyFun'
  • 79. Hive 简介 • 数据仓库, • 把类似SQL的语法转化成MapReduce程序 • 不支持Index,Transaction,分钟级别的延时 • 不支持SQL的Having • 数据类型支持 – 基本类型string,int,double,boolean等 – 复杂类型的Array,Map,Struct
  • 80. Hive 数据仓库 • % export HIVE_HOME=/home/my/hive • 运行:bin/hive • hive> SHOW TABLES; • hive -f script.q • hive -e 'SELECT * FROM dummy'
  • 81. Hive建表与载入数据 • 建表 • CREATE TABLE records – (year STRING, temperature INT, quality INT) – ROW FORMAT DELIMITED – FIELDS TERMINATED BY 't'; • 从文件载入: – LOAD DATA LOCAL INPATH 'input/ncdc/micro- tab/sample.txt' – OVERWRITE INTO TABLE records
  • 82. Error while making MR scratch directory • 把hadoop的配置文件core-site.xml中的: • fs.default.name 里的值改成 hosts里的名称 • 然后重启hadoop和hive • 如果提示name node is in safe mod – hadoop dfsadmin -safemode leave – 或在hdfs上建立相关目录并加权限 – % hadoop fs -mkdir /tmp – % hadoop fs -chmod a+w /tmp – % hadoop fs -mkdir /user/hive/warehouse – % hadoop fs -chmod a+w /user/hive/warehouse
  • 83. Hive启动模式 • hive - - service hiveserver
  • 85. 复杂类型 • CREATE TABLE complex ( • col1 ARRAY<INT>, • col2 MAP<STRING, INT>, • col3 STRUCT<a:STRING, b:INT, c:DOUBLE> • ); • 查询: • SELECT col1[0], col2['b'], col3.c FROM complex
  • 86. 托管表与外部表 managed tables and External tables • 托管表会移动数据到Hive的数据仓库目录 – CREATE TABLE managed_table (dummy STRING); – LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table; • 外部表: – CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; – LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table; – 删除外部表的时候不会删除数据,只删除metaata
  • 87. Hive Partition分区 • 会按目录保存数据 – /user/hive/warehouse/tab4/level=2/city=beijing/h2.tx t (红色部分是partition) • 建表 – CREATE TABLE logs (ts BIGINT, line STRING) – PARTITIONED BY (dt STRING, country STRING); • 使用: – LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' – INTO TABLE logs – PARTITION (dt='2001-01-01', country='GB');
  • 88. Hive Buckets • CREATE TABLE bucketed_users (id INT, name STRING) • CLUSTERED BY (id) INTO 4 BUCKETS; • 分隔成4片,用于拆分成多个MapReduce任 务
  • 89. 分隔符 • CREATE TABLE ... • ROW FORMAT DELIMITED • FIELDS TERMINATED BY '001' • COLLECTION ITEMS TERMINATED BY '002' • MAP KEYS TERMINATED BY '003' • LINES TERMINATED BY 'n' • STORED AS TEXTFILE;
  • 90. 指定序列化反序列化 • CREATE TABLE stations (usaf STRING, wban STRING, name STRING) • ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' • WITH SERDEPROPERTIES ( • "input.regex" = "(d{6}) (d{5}) (.{29}) .*" • ); • hive> SELECT * FROM stations LIMIT 4; • 010000 99999 BOGUS NORWAY • 010003 99999 BOGUS NORWAY • 010010 99999 JAN MAYEN • 010013 99999 ROST
  • 91. 表命令 • create table xxx as select name,age from tab2 • ALTER TABLE source RENAME TO target; • ALTER TABLE target ADD COLUMNS (col3 STRING); • create table XXX as select c1,c2 from Tab2
  • 92. 自定义函数UDF • select myFun(age) from tab3; • public class MyFun extends UDF { • } • 编写完以后注册 : – create temporary function myFun as 'com.MyFun'
  • 94. HBase • start-hbase.sh • hbase shell • create 'tab1','col' • list 显示表 • put 'tab1','row1', 'col:name', 'XiaoMing' • put 'tab1', 'row1', 'col:age', '10' • put 'tab2', 'row2', 'col:name', 'DaMing' • 删除表 – disable 'tab1' – drop 'tab1'
  • 95. HBase API Get • @Test • public void testGet() throws IOException { • Configuration conf = HBaseConfiguration.create(); • // conf.set("hbase.master.port", "localhost:PORT"); • // conf.set("hbase.zookeeper.quorum", "IP"); • HTable table = new HTable(conf, "tab1"); • Get get = new Get(Bytes.toBytes("r1")); • get.addColumn(Bytes.toBytes("col"), Bytes.toBytes("name")); • Result result = table.get(get); • byte[] value = result.value(); • System.out.println("v:" + Bytes.toString(value)); • byte[] val = result.getValue(Bytes.toBytes("col"), Bytes.toBytes("name")); • System.out.println("Value: " + Bytes.toString(val)); • }