More Related Content Similar to COOKPADでのHadoop利用 Similar to COOKPADでのHadoop利用 (20) More from Tatsuya Sasaki (8) COOKPADでのHadoop利用21. master( )
slave( )
key Reducer
Mapper Reducer
23. Hadoop
• Hadoop
• EC2 S3
ver. 0.18.3
• Cloudera & Hadoop Streaming
24. S3 Native FileSystem
• Hadoop
• 5GB
• s3n:// ← ”n”
S3 Block FileSystem
• Hadoop
• HDFS
•
• s3://
29. cat hoge.csv | ruby mapper.rb | ruby reducer.rb
Reducer
Mapper Reducer
Mapper→Reducer key Reducer
31. master
1) -file
master slave scp
hadoop jar xxx.jar
-mapper hoge.rb -reducer fuga.rb
-file hoge.rb -file fuga.rb
-file
2) mapper, reducer
File.open(‘ ’) {|f| ...}
32. S3
1) -cacheFile
S3 slave
hadoop jar xxx.jar
-mapper hoge.rb -reducer fuga.rb
-file hoge.rb -file fuga.rb
-cacheFile s3n://path/to/ #
2) mapper, reducer
File.open(‘ ’) {|f| ...}
34. p target_ids.size # 50000
ARGF.each do |log|
log.chomp!
id, type, ... = log.split(/,/)
next if target_ids.include?(id)
end
target_ids 5
…
35. [13930, 29011, 39291, ...] # 50000
1000
{
‘139’ => [13930, 13989, 13991, ...], # 50
‘290’ => [29011, 29098, 29076, ...], # 50
‘392’ => [39291, 39244, 39251, ...], # 50
}
36. 50
hash = Hash.new {|h,k| h[k] = []}
target_ids.each do |id|
hash[ id.to_s[0,3] ] << id
end
ARGF.each do |log|
log.chomp!
id, type, ... = log.split(/,/)
next if hash[ id[0,3] ].include?(id)
end
38. 8 7 8 …
http://ow.ly/2bdW1
S3 Native FileSystem
java.net.SocketTimeoutException: Read timed out
39. 8Amazon7 Elastic MapReduce 8 …
http://ow.ly/2bdW1
S3 Native FileSystem
java.net.SocketTimeoutException: Read timed out
40. 8Amazon7 Elastic MapReduce 8 …
http://ow.ly/2bdW1
Amazon Elastic MapReduce
S3 Native FileSystem
java.net.SocketTimeoutException: Read timed out
Hadoop 0.21