Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Large-Scale Graph Processing〜Introduction〜(完全版)

グラフデータの大規模処理はMapReduceよりも効率の良い計算モデル が提案され、Google Pregel・Giraph・Hama・GoldenOrb等のプロジェクトにおいて実装 が進められています。またHamaやGiraphはNextGen Apache Hadoop MapReduceへ の対応が進められています。本LTでは"Large Scale Graph Processing"とはどのようなものをMap Reduceと比較して紹介するとともに、最後に各プロジェクトの特徴を挙げています。

Large-Scale Graph Processing〜Introduction〜(完全版)

  1. 1. http://www.catehuston.com/blog/2009/11/02/touchgraph/
  2. 2. Hadoop MapReduce デザインパターン——MapReduceによる大規模テキストデータ処理1 Jimmy Lin, Chris Dyer�著、神 林 飛志、野村 直之�監修、玉川 竜司�訳2 2011年10月01日 発売予定3 210ページ4 定価2,940円
  3. 3. Shuffle & barrier job start/ shutdowni i+1
  4. 4. 1 B E 5 1 4A D G 3 3 2 4 C 5 F
  5. 5. 5 1 B E 5 1 3 4A D G 3 3 2 5!4 min(6,4) 4 1 B E C 5 F 5 1 i 3 4 A D G 3 3 2 4 3 2 C 5 F i+1
  6. 6. a super step http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel
  7. 7. ...
  8. 8. a super step
  9. 9. a super step
  10. 10. 1 B E 5 1 4A D G 3 3 2 4 C 5 F initialize
  11. 11. +∞ B 1 +∞ E 5 10 +∞ 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 1
  12. 12. +∞ B 1 +∞ E 5 10 +∞ 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 1
  13. 13. +∞ B 1 +∞ E 5 10 +∞ 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 1
  14. 14. 5 1 +∞ B E 5 10 3 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 1
  15. 15. 5 1 +∞ B E 5 10 3 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 2
  16. 16. 5 1 +∞ B E 5 10 3 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 2
  17. 17. 4 1 6 B E 5 10 3 4 +∞A D G 3 3 2 4 6 C 5 F 5 2
  18. 18. 4 1 6 B E 5 10 3 4 +∞A D G 3 3 2 4 6 C 5 F 5 3
  19. 19. 4 1 6 B E 5 10 3 4 +∞A D G 3 3 2 4 6 C 5 F 5 3
  20. 20. 4 1 5 B E 5 10 3 4 9A D G 3 3 2 4 6 C 5 F 5 3
  21. 21. 4 1 5 B E 5 10 3 4 9A D G 3 3 2 4 6 C 5 F 5 end
  22. 22. class ShortestPathMapper(Mapper) def map(self, node_id, Node): # send graph structure emit node_id, Node # get node value and add it to edge distance dist = Node.get_value() for neighbour_node_id in Node.get_adjacency_list(): dist_to_nbr = Node.get_distance( node_id, neighbour_node_id ) emit neighbour_node_id, dist + dist_to_nbr
  23. 23. class ShortestPathReducer(Reducer): def reduce(self, node_id, dist_list): min_dist = sys.maxint for dist in dist_list: # dist_list contains a Node if is_node(dist): Node = dist elif dist < min_dist: min_dist = dist Node.set_value(min_dist)" emit node_id, Node
  24. 24. # In-Mapper Combinerclass ShortestPathMapper(Mapper): def __init__(self): self.buffer = {} def check_and_put(self, key, value): if key not in self.buffer or value < self.buffer[key]: self.buffer[key] = value def check_and_emit(self): if is_exceed_limit_buffer_size(self.buffer): for key, value in self.buffer.items(): emit key, value self.buffer = {} def close(self): for key, value in self.buffer.items(): emit key, value
  25. 25. #...continue def map(self, node_id, Node): # send graph structure emit node_id, Node # get node value and add it to edge distance dist = Node.get_value() for nbr_node_id in Node.get_adjacency_list(): dist_to_nbr = Node.get_distance(node_id, nbr_node_id) dist_nbr = dist + dist_to_nbr check_and_put(nbr_node_id, dist_nbr) check_and_emit()
  26. 26. # Shimmy trickclass ShortestPathReducer(Reducer): def __init__(self): P.open_graph_partition() def emit_precede_node(self, node_id): for pre_node_id, Node in P.read(): if node_id == pre_node_id: return Node else: emit pre_node_id, Node
  27. 27. #(...continue) def reduce(node_id, dist_list): Node = self.emit_precede_node(node_id) min_dist = sys.maxint for dist in dist_list: if dist < min_dist: min_dist = dist Node.set_value(min_dist) emit node_id, Node
  28. 28. +∞ B 1 +∞ E 5 10 +∞ 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 1
  29. 29. +∞ B 1 +∞ E 5 10 +∞ 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 1
  30. 30. +∞ B 1 +∞ E 5 10 +∞ 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 1
  31. 31. 5 1 +∞ B E 5 10 3 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 2
  32. 32. 5 1 +∞ B E 5 10 3 4 +∞A D G 3 3 2 4 +∞ C 5 F +∞ 2
  33. 33. 4 1 6 B E 5 10 3 4 +∞A D G 3 3 2 4 6 C 5 F 5 3
  34. 34. 4 1 6 B E 5 10 3 4 +∞A D G 3 3 2 4 6 C 5 F 5 3
  35. 35. 4 1 5 B E 5 10 3 4 9A D G 3 3 2 4 6 C 5 F 5 4
  36. 36. 4 1 5 B E 5 10 3 4 9A D G 3 3 2 4 6 C 5 F 5 4
  37. 37. 4 1 5 B E 5 10 3 4 9A D G 3 3 2 4 6 C 5 F 5 5
  38. 38. 4 1 5 B E 5 10 3 4 9A D G 3 3 2 4 6 C 5 F 5 5
  39. 39. 4 1 5 B E 5 10 3 4 9A D G 3 3 2 4 6 C 5 F 5 end
  40. 40. class ShortestPathVertex: def compute(self, msgs): min_dist = 0 if self.is_source() else sys.maxint; # get values from all incoming edges. for msg in msgs: min_dist = min(min_dist, msg.get_value()) if min_dist < self.get_value(): # update current value(state). " self.set_current_value(min_dist) # send new value to outgoing edge. out_edge_iterator = self.get_out_edge_iterator() for out_edge in out_edge_iterator: recipient = out_edge.get_other_element(self.get_id()) self.send_massage(recipient.get_id(), min_dist + out_edge.get_distance() ) self.vote_to_halt()
  41. 41. Pregel
  42. 42. Science and Technology), South Korea edwardyoon@apache.org Science and Technology), South Korea swseo@calab.kaist.ac.kr jaehong@calab.kaist.ac.kr Seongwook Jin Jin-Soo Kim Seungryoul Maeng Computer Science Division School of Information and Communication Computer Science DivisionKAIST (Korea Advanced Institute of Sungkyunkwan University, South Korea KAIST (Korea Advanced Institute ofScience and Technology), South Korea jinsookim@skku.edu Science and Technology), South Korea swjin@calab.kaist.ac.kr maeng@calab.kaist.ac.kr Abstract—APPLICATION. Various scientific computations HAMA APIhave become so complex, and thus computation tools play an HAMA Core HAMA Shellimportant role. In this paper, we explore the state-of-the-artframework providing high-level matrix computation primitives Computation Enginewith MapReduce through the case study approach, and demon- MapReduce BSP Dryad (Plugged In/Out)strate these primitives with different computation engines toshow the performance and scalability. We believe the opportunity Zookeeper Distributed Lockingfor using MapReduce in scientific computation is even morepromising than the success to date in the parallel systemsliterature. HBase Storage Systems HDFS RDBMS I. I NTRODUCTION File As cloud computing environment emerges, Google has Fig. 1. The overall architecture of HAMA.introduced the MapReduce framework to accelerate parallel http://wiki.apache.org/hama/Articlesand distributed computing on more than a thousand of in-expensive machines. Google has shown that the MapReduceframework is easy to use and provides massive scalability HAMA is a distributed framework on Hadoop for massivewith extensive fault tolerance [2]. Especially, MapReduce fits matrix and graph computations. HAMA aims at a power-well with complex data-intensive computations such as high- ful tool for various scientific applications, providing basicdimensional scientific simulation, machine learning, and data primitives for developers and researchers with simple APIs.mining. Google and Yahoo! are known to operate dedicated HAMA is currently being incubated as one of the subprojectsclusters for MapReduce applications, each cluster consisting of Hadoop by the Apache Software Foundation [10].of several thousands of nodes. One of typical MapReduce Figure 1 illustrates the overall architecture of HAMA.

×