クラウド時代の並列分散処理技術

クラウド時代の並列分散処理技術 2010/12/22 Ruby ビジネスセミナーハピルス株式会社藤川幸一 Twitter: @fujibee

アジェンダ ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

自己紹介 ,[object Object],[object Object],[object Object],[object Object],[object Object]

クラウド時代と大量データ

クラウド時代とは ,[object Object],[object Object],[object Object]

クラウドと大量データ ,[object Object],[object Object],[object Object]

大量データを処理するには ,[object Object],[object Object],[object Object],[object Object],並列分散の仕組みが必要例えば：ファイルシステム・処理技術

スケールアップとスケールアウト ,[object Object],[object Object],性能価格ある程度まで行くと価格と性能が比例しない

スケールアップとスケールアウト ,[object Object],[object Object],性能価格 ( 台数 ) 価格と性能は（ある程度まで）比例する

並列分散処理は難しい ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],引用： Think IT http://thinkit.co.jp/story/2010/06/11/1608

並列分散処理フレームワーク MapReduce ,[object Object],[object Object],[object Object]

Hadoop ,[object Object],[object Object],[object Object],[object Object],[object Object]

分散データストア技術 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Ruby による並列分散処理技術 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

未踏プロジェクトについて

IPA 未踏人材育成事業 ,[object Object],[object Object],[object Object],[object Object],[object Object]

私のプロジェクト ,[object Object],[object Object],[object Object],[object Object],[object Object]

Hadoop Papyrus ,[object Object],[object Object],[object Object],[object Object]

Step.1 Java ではなく Ruby で記述

Step.2 Ruby による DSL で MapReduce をシンプルに Map Reduce Job Description Log Analysis DSL

Step.3 Hadoop サーバ構成を容易に利用可能に

package org . apache . hadoop . examples ; import java.io.IOException ; import java.util.StringTokenizer ; import org.apache.hadoop.conf.Configuration ; import org.apache.hadoop.fs.Path ; import org.apache.hadoop.io.IntWritable ; import org.apache.hadoop.io.Text ; import org.apache.hadoop.mapreduce.Job ; import org.apache.hadoop.mapreduce.Mapper ; import org.apache.hadoop.mapreduce.Reducer ; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat ; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat ; import org.apache.hadoop.util.GenericOptionsParser ; public class WordCount { public static class TokenizerMapper extends Mapper < Object , Text , Text , IntWritable > { private final static IntWritable one = new IntWritable ( 1 ); private Text word = new Text (); public void map( Object key , Text value , Context context ) throws IOException , InterruptedException { StringTokenizer itr = new StringTokenizer ( value . toString ()); while ( itr . hasMoreTokens ()) { word . set ( itr . nextToken ()); context . write ( word , one ); } } } public static class IntSumReducer extends Reducer < Text , IntWritable , Text , IntWritable > { private IntWritable result = new IntWritable (); public void reduce( Text key , Iterable < IntWritable > values , Context context ) throws IOException , InterruptedException { int sum = 0 ; for ( IntWritable val : values ) { sum += val . get (); } result . set ( sum ); context . write ( key , result ); } } public static void main( String [] args ) throws Exception { Configuration conf = new Configuration (); String [] otherArgs = new GenericOptionsParser ( conf , args ) . getRemainingArgs (); if ( otherArgs . length != 2 ) { System . err . println ( "Usage: wordcount <in> <out>" ); System . exit ( 2 ); } Job job = new Job ( conf , "word count" ); job . setJarByClass ( WordCount . class ); job . setMapperClass ( TokenizerMapper . class ); job . setCombinerClass ( IntSumReducer . class ); job . setReducerClass ( IntSumReducer . class ); job . setOutputKeyClass ( Text . class ); job . setOutputValueClass ( IntWritable . class ); FileInputFormat . addInputPath ( job , new Path ( otherArgs [ 0 ])); FileOutputFormat . setOutputPath ( job , new Path ( otherArgs [ 1 ])); System . exit ( job . waitForCompletion (true) ? 0 : 1 ); } } 同様な処理が Java では 70 行必要だが、 HadoopPapyrus だと 10 行に！ dsl 'LogAnalysis‘ from ‘test/in‘ to ‘test/out’ pattern /([^|:]+)[^:]*/ column_name :link topic "link num", :label => 'n' do count_uniq column[:link] end Java Hadoop Papyrus

実際の画面など ,[object Object]

Hadoop は敷居が高い ,[object Object],[object Object],[object Object],[object Object]

Hadoop をサービスとして使う ,[object Object],[object Object]

Hapyrus ,[object Object],[object Object]

Sneak Preview ,[object Object]

まとめ ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

ご清聴ありがとうございました

クラウド時代の並列分散処理技術

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à クラウド時代の並列分散処理技術

Similaire à クラウド時代の並列分散処理技術 (20)

Dernier

Dernier (11)

クラウド時代の並列分散処理技術

Notes de l'éditeur