SlideShare a Scribd company logo
1 of 32
Processing Over a Billion Edges
on Apache Giraph
Hadoop Summit 2012



Avery Ching
Software Engineer
6/14/2012
Agenda

1   Motivation and Background

2   Giraph Concepts/API

3   Example Applications

4   Architecture Overview

5   Recent/Future Improvements
What is Apache Giraph?

•  Loose implementation of Google’s Pregel that runs
   as a map-only job on Hadoop

•  “Think like a vertex” that can send messages to any
   other vertex in the graph using the bulk synchronous
   parallel programming model

•  An in-memory scalable system*
 ▪    Will be enhanced with out-of-core messages/vertices to handle
      larger problem sets.
What (social) graphs are we targeting?

•  3/2012 LinkedIn has 161 million users

•  6/2012 Twitter discloses 140 million MAU

•  4/2012 Facebook declares 901 million MAU
Example applications

•  Ranking
 ▪    Popularity, importance, etc.

•  Label Propagation
 ▪    Location, school, gender, etc.

•  Community
 ▪    Groups, interests
Bulk synchronous parallel

•  Supersteps
 ▪    A global epoch followed by a global barrier where components
      do concurrent computation and send messages

•  Point-to-point messages (i.e. vertex to vertex)
 ▪    Sent during a superstep from one component to another and
      then delivered in the following superstep

•  Computation complete when all components
   complete
Computation +             Superstep
         Communication
   Processors




Time                     Barrier
MapReduce -> Giraph
“Think like a vertex”, not a key-value pair!

        MapReduce                         Giraph
public class Mapper<
                              public class Vertex<
     KEYIN,
                                   I extends
     VALUEIN,
                                      WritableComparable,
     KEYOUT,
                                   V extends Writable,
     VALUEOUT> {
                                   E extends Writable,
  void map(KEYIN key,
                                   M extends Writable> {
     VALUEIN value,
                                void compute(
     Context context)
                                   Iterator<M> msgIterator);
     throws IOException,
                              }
     InterruptedException;
}
Basic Giraph API
  Methods available to compute()

 Immediate effect/access                    Next superstep
I getVertexId()                    void sendMsg(I id, M msg)
V getVertexValue()                 void sendMsgToAllEdges(M msg)
void setVertexValue(V vertexValue)
                                   void addVertexRequest(
Iterator<I> iterator()               BasicVertex<I, V, E, M> vertex)
E getEdgeValue(I targetVertexId)   void removeVertexRequest(I vertexId)
boolean hasEdge(I targetVertexId) void addEdgeRequest(
boolean addEdge(I targetVertexId,    I sourceVertexId,
                       E             Edge<I, E> edge)
edgeValue)                         void removeEdgeRequest(
E removeEdge(I targetVertexId)       I sourceVertexId,
                                     I destVertexId)
void voteToHalt()
boolean isHalted()
Why not implement Giraph with multiple
MapReduce jobs?
•  Too much disk, no in-memory caching, a superstep
   becomes a job!

    Input     Map    Intermediate Reduce    Output
   format    tasks        files    tasks    format

   Split 0
                                            Output 0
   Split 1

   Split 2

   Split 3                                  Output 1
Giraph is a single Map-only job in
Hadoop
•  Hadoop is purely a resource manager for Giraph, all
   communication is done through Netty-based IPC

        Vertex input     Map      Vertex output
          format        tasks        format

           Split 0
                                     Output 0
           Split 1

           Split 2

           Split 3                   Output 1
Maximum vertex value implementation
public class MaxValueVertex extends EdgeListVertex<
    IntWritable, IntWritable, IntWritable, IntWritable> {
  @Override
  public void compute(Iterator<IntWritable> msgIterator) {
    boolean changed = false;
    while (msgIterator.hasNext()) {
      IntWritable msgValue = msgIterator.next();
      if (msgValue.get() > getVertexValue().get()) {
        setVertexValue(msgValue);
        changed = true;
      }
    }
    if (getSuperstep() == 0 || changed) {
      sendMsgToAllEdges(getVertexValue());
    } else {
      voteToHalt();
    }
  }
}
Maximum vertex value


 Processor 1   5             5             5             5




 Processor 2   1
                             1
                                           5             5
                             5



                                           2
               2             2                           5
                                           5




 Time              Barrier       Barrier       Barrier
Page rank implementation
public class SimplePageRankVertex extends EdgeListVertex<LongWritable,
DoubleWritable, FloatWritable, DoubleWritable> {
  public void compute(Iterator<DoubleWritable> msgIterator) {
     if (getSuperstep() >= 1) {
         double sum = 0;
         while (msgIterator.hasNext()) {
            sum += msgIterator.next().get();
         }
         setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f *
sum);
     }
     if (getSuperstep() < 30) {
         long edges = getNumOutEdges();
         sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges));
     } else {
         voteToHalt();
     }
  }
}
Giraph In MapReduce
Giraph components
•  Master – Application coordinator
 ▪    One active master at a time
 ▪    Assigns partition owners to workers prior to each superstep
 ▪    Synchronizes supersteps

•  Worker – Computation & messaging
 ▪    Loads the graph from input splits
 ▪    Does the computation/messaging of its assigned partitions

•  ZooKeeper
 ▪    Maintains global application state
Graph distribution
•  Master graph partitioner
 ▪    Create initial partitions, generate partition owner changes
      between supersteps

•  Worker graph partitioner
 ▪    Determine which partition a vertex belongs to
 ▪    Create/modify the partition stats (can split/merge partitions)

•  Default is hash partitioning (hashCode())
 ▪    Range-based partitioning is also possible on a per-type basis
Graph distribution example


          Partition 0              Load/Store   Stats 0
                        Worker 0    Compute
          Partition 1              Messages     Stats 1

          Partition 2              Load/Store   Stats 2
 Master                 Worker 1    Compute
          Partition 3              Messages     Stats 3

          Partition 4              Load/Store   Stats 4
                        Worker 2    Compute
          Partition 5              Messages     Stats 5

          Partition 6              Load/Store   Stats 6
                        Worker 3    Compute
          Partition 7              Messages     Stats 7
Customizable fault tolerance
•  No single point of failure from Giraph threads
 ▪    With multiple master threads, if the current master dies, a new one will automatically
      take over.
 ▪    If a worker thread dies, the application is rolled back to a previously checkpointed
      superstep. The next superstep will begin with the new amount of workers
 ▪    If a zookeeper server dies, as long as a quorum remains, the application can proceed

•  Hadoop single points of failure still exist
 ▪    Namenode, jobtracker

 ▪    Restarting manually from a checkpoint is always possible




                                                 19
Master thread fault tolerance
 Before failure of active master 0            After failure of active master 0
  “Active”                                       “Active”
  Master 0                                       Master 0
                            Active                                       Active
                            Master                                       Master
  “Spare”                   State                “Active”                State
  Master 1                                       Master 1

  “Spare”                                        “Spare”
  Master 2                                       Master 2

•  One active master, with spare masters taking over in the event of an active master
   failure

•  All active master state is stored in ZooKeeper so that a spare master can
   immediately step in when an active master fails

•  “Active” master implemented as a queue in ZooKeeper

                                            20
Worker thread fault tolerance
  Superstep i         Superstep i+1       Superstep i+2
(no checkpoint)        (checkpoint)      (no checkpoint)

                                         Worker failure!


                      Superstep i+1       Superstep i+2       Superstep i+3
                       (checkpoint)      (no checkpoint)       (checkpoint)
                                                            Worker failure after
                                                           checkpoint complete!


                                                              Superstep i+3        Application
                                                             (no checkpoint)       Complete
•  A single worker death fails the superstep

•  Application reverts to the last committed superstep automatically
 ▪    Master detects worker failure during any superstep with a ZooKeeper “health”
      znode
 ▪    Master chooses the last committed superstep and sends a command through
      ZooKeeper for all workers to restart from that superstep
                                              21
Optional features
•  Combiners
 ▪    Similar to Map-Reduce combiners
 ▪    Users implement a combine() method that can reduce the
      amount of messages sent and received
 ▪    Run on both the client side (memory, network) and server side
      (memory)

•  Aggregators
 ▪    Similar to MPI aggregation routines (i.e. max, min, sum, etc.)
 ▪    Commutative and associate operations that are performed
      globally
 ▪    Examples include global communication, monitoring, and
      statistics
Recent Netty IPC implementation
                                                   300                   50
                                                   250




                                  Time (Seconds)
•  Big improvement over the                                              40
   Hadoop RPC implementation                       200
                                                                         30
                                                   150
•  10-39% overall performance                                            20
   improvement                                     100
                                                    50                   10
•  Still need more Netty tuning                      0                   0
                                                         10   30    50
                                                              Workers

                                                         Netty
                                                         Hadoop RPC
                                                         % improvement
Recent benchmarks
•  Test cluster of 80 machines
     ▪    Facebook Hadoop (https://github.com/facebook/hadoop-20)
     ▪    72 cores, 64+ GB of memory
▪    org.apache.giraph.benchmark.PageRankBenchmark
     ▪    5 supersteps
     ▪    No checkpointing
     ▪    10 edges per vertex
Worker scalability
                  3000
 Time (Seconds)

                  2500
                  2000
                  1500
                  1000
                   500
                     0
                         10   20   30  40    45   50
                                   Workers
Edge Scalability
                  5000
 Time (Seconds)

                  4000
                  3000
                  2000
                  1000
                     0
                         1    2     3      4    5
                             Edges (Billions)
Worker / edge scalability
                  2000                               8
 Time (Seconds)




                                                         Edges (Billions)
                  1500                               6
                  1000                               4
                   500                               2
                     0                               0
                         10         30        50
                                    Workers
                         Run Time    Workers/Edges
Apache Giraph has graduated as of
5/2012
•  Incubated for less than a year (entered incubator
   9/12)

•  Committers from HortonWorks, Twitter, LinkedIn,
   Facebook, TrendMicro and various schools (VU
   Amsterdam, TU Berlin, Korea University)

•  Released 0.1 as of 2/6/2012, will be release 0.2
   within a few months
Future improvements
•  Out-of-core messages/graph
 ▪    Under memory pressure, dump messages/portions of the graph
      to local disk
 ▪    Ability to run applications without having all needed memory

•  Performance improvements
 ▪    Netty is a good step in the right direction, but need to tune
      messaging performance as it takes up a majority of the time
 ▪    Scale back use of ZooKeeper to only be for health registration,
      rather than implementing aggregators and coordination
More future improvements
•  Adding a master#compute() method
 ▪    Arbitrary master computation that sends results to workers prior
      to a superstep to simplify certain algorithms
 ▪    GIRAPH-127

•  Handling skew
 ▪    Some vertices have a large number of edges and we need to
      break them up and handle them differently to provide better
      scalability
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Sessions will resume at 4:30pm




                             Page 32

More Related Content

What's hot

Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Ken SASAKI
 
HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019 #hc...
HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019  #hc...HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019  #hc...
HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019 #hc...Yahoo!デベロッパーネットワーク
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
 
HDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wHDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wCloudera Japan
 
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)Zain Asgar
 
HBase スキーマ設計のポイント
HBase スキーマ設計のポイントHBase スキーマ設計のポイント
HBase スキーマ設計のポイントdaisuke-a-matsui
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)NTT DATA Technology & Innovation
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) hamaken
 
知っておくべきCephのIOアクセラレーション技術とその活用方法 - OpenStack最新情報セミナー 2015年9月
知っておくべきCephのIOアクセラレーション技術とその活用方法 - OpenStack最新情報セミナー 2015年9月知っておくべきCephのIOアクセラレーション技術とその活用方法 - OpenStack最新情報セミナー 2015年9月
知っておくべきCephのIOアクセラレーション技術とその活用方法 - OpenStack最新情報セミナー 2015年9月VirtualTech Japan Inc.
 
さいきんのMySQLに関する取り組み(仮)
さいきんのMySQLに関する取り組み(仮)さいきんのMySQLに関する取り組み(仮)
さいきんのMySQLに関する取り組み(仮)Takanori Sejima
 
kubectl apply -f cloud-Infrastructure.yaml mit Crossplane et al.pdf
kubectl apply -f cloud-Infrastructure.yaml mit Crossplane et al.pdfkubectl apply -f cloud-Infrastructure.yaml mit Crossplane et al.pdf
kubectl apply -f cloud-Infrastructure.yaml mit Crossplane et al.pdfQAware GmbH
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Cloudera Japan
 
Facebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeFacebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeCristina Munoz
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャーKubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャーToru Makabe
 
ARM LinuxのMMUはわかりにくい
ARM LinuxのMMUはわかりにくいARM LinuxのMMUはわかりにくい
ARM LinuxのMMUはわかりにくいwata2ki
 
ネットワーク運用自動化の実際〜現場で使われているツールを調査してみた〜
ネットワーク運用自動化の実際〜現場で使われているツールを調査してみた〜ネットワーク運用自動化の実際〜現場で使われているツールを調査してみた〜
ネットワーク運用自動化の実際〜現場で使われているツールを調査してみた〜Taiji Tsuchiya
 

What's hot (20)

Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識
 
HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019 #hc...
HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019  #hc...HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019  #hc...
HDFSのスケーラビリティの限界を突破するためのさまざまな取り組み | Hadoop / Spark Conference Japan 2019 #hc...
 
Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
 
HDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wHDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13w
 
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
 
HBase スキーマ設計のポイント
HBase スキーマ設計のポイントHBase スキーマ設計のポイント
HBase スキーマ設計のポイント
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
 
知っておくべきCephのIOアクセラレーション技術とその活用方法 - OpenStack最新情報セミナー 2015年9月
知っておくべきCephのIOアクセラレーション技術とその活用方法 - OpenStack最新情報セミナー 2015年9月知っておくべきCephのIOアクセラレーション技術とその活用方法 - OpenStack最新情報セミナー 2015年9月
知っておくべきCephのIOアクセラレーション技術とその活用方法 - OpenStack最新情報セミナー 2015年9月
 
さいきんのMySQLに関する取り組み(仮)
さいきんのMySQLに関する取り組み(仮)さいきんのMySQLに関する取り組み(仮)
さいきんのMySQLに関する取り組み(仮)
 
kubectl apply -f cloud-Infrastructure.yaml mit Crossplane et al.pdf
kubectl apply -f cloud-Infrastructure.yaml mit Crossplane et al.pdfkubectl apply -f cloud-Infrastructure.yaml mit Crossplane et al.pdf
kubectl apply -f cloud-Infrastructure.yaml mit Crossplane et al.pdf
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
 
Facebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeFacebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challenge
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャーKubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
Kubernetesのしくみ やさしく学ぶ 内部構造とアーキテクチャー
 
ARM LinuxのMMUはわかりにくい
ARM LinuxのMMUはわかりにくいARM LinuxのMMUはわかりにくい
ARM LinuxのMMUはわかりにくい
 
Hadoop
HadoopHadoop
Hadoop
 
ネットワーク運用自動化の実際〜現場で使われているツールを調査してみた〜
ネットワーク運用自動化の実際〜現場で使われているツールを調査してみた〜ネットワーク運用自動化の実際〜現場で使われているツールを調査してみた〜
ネットワーク運用自動化の実際〜現場で使われているツールを調査してみた〜
 

Viewers also liked

2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Tokyo nlp #8 label propagation
Tokyo nlp #8 label propagationTokyo nlp #8 label propagation
Tokyo nlp #8 label propagationYo Ehara
 
Core Messages in Job Hunting
Core Messages in Job HuntingCore Messages in Job Hunting
Core Messages in Job HuntingChrisSteed
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014鉄平 土佐
 
Deploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDeploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDialogic Inc.
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
 
Introduction of apache giraph project
Introduction of apache giraph projectIntroduction of apache giraph project
Introduction of apache giraph projectChun Cheng Lin
 
大規模グラフデータ処理
大規模グラフデータ処理大規模グラフデータ処理
大規模グラフデータ処理maruyama097
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Netty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersNetty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersRick Hightower
 
Initiation à Neo4j
Initiation à Neo4jInitiation à Neo4j
Initiation à Neo4jNeo4j
 
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Junichi Noda
 
Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Yosuke Mizutani
 
Building day 2 upload Building the Internet of Things with Thingsquare and ...
Building day 2   upload Building the Internet of Things with Thingsquare and ...Building day 2   upload Building the Internet of Things with Thingsquare and ...
Building day 2 upload Building the Internet of Things with Thingsquare and ...Adam Dunkels
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 

Viewers also liked (20)

2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Tokyo nlp #8 label propagation
Tokyo nlp #8 label propagationTokyo nlp #8 label propagation
Tokyo nlp #8 label propagation
 
Core Messages in Job Hunting
Core Messages in Job HuntingCore Messages in Job Hunting
Core Messages in Job Hunting
 
GETTING YOUR MESSAGE RIGHT
GETTING YOUR MESSAGE RIGHTGETTING YOUR MESSAGE RIGHT
GETTING YOUR MESSAGE RIGHT
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
 
Deploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDeploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspective
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARN
 
Apache giraph
Apache giraphApache giraph
Apache giraph
 
Introduction of apache giraph project
Introduction of apache giraph projectIntroduction of apache giraph project
Introduction of apache giraph project
 
大規模グラフデータ処理
大規模グラフデータ処理大規模グラフデータ処理
大規模グラフデータ処理
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Netty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersNetty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and Buffers
 
Initiation à Neo4j
Initiation à Neo4jInitiation à Neo4j
Initiation à Neo4j
 
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
 
Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析
 
Building day 2 upload Building the Internet of Things with Thingsquare and ...
Building day 2   upload Building the Internet of Things with Thingsquare and ...Building day 2   upload Building the Internet of Things with Thingsquare and ...
Building day 2 upload Building the Internet of Things with Thingsquare and ...
 
Best Practices to Build a Multichannel Campaign Plan
Best Practices to Build a Multichannel Campaign Plan Best Practices to Build a Multichannel Campaign Plan
Best Practices to Build a Multichannel Campaign Plan
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 

Similar to Processing edges on apache giraph

2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwordsNitay Joffe
 
Mining quasi bicliques using giraph
Mining quasi bicliques using giraphMining quasi bicliques using giraph
Mining quasi bicliques using giraphHsiao-Fei Liu
 
Java Review
Java ReviewJava Review
Java Reviewpdgeorge
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers TrainingJan Gregersen
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers TrainingJan Gregersen
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoopRon Sher
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...The Linux Foundation
 

Similar to Processing edges on apache giraph (20)

2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
 
Pregel and giraph
Pregel and giraphPregel and giraph
Pregel and giraph
 
Mining quasi bicliques using giraph
Mining quasi bicliques using giraphMining quasi bicliques using giraph
Mining quasi bicliques using giraph
 
Java Review
Java ReviewJava Review
Java Review
 
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje CrnjakJavantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Graph processing
Graph processingGraph processing
Graph processing
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers Training
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers Training
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Processing edges on apache giraph

  • 1. Processing Over a Billion Edges on Apache Giraph Hadoop Summit 2012 Avery Ching Software Engineer 6/14/2012
  • 2. Agenda 1 Motivation and Background 2 Giraph Concepts/API 3 Example Applications 4 Architecture Overview 5 Recent/Future Improvements
  • 3. What is Apache Giraph? •  Loose implementation of Google’s Pregel that runs as a map-only job on Hadoop •  “Think like a vertex” that can send messages to any other vertex in the graph using the bulk synchronous parallel programming model •  An in-memory scalable system* ▪  Will be enhanced with out-of-core messages/vertices to handle larger problem sets.
  • 4. What (social) graphs are we targeting? •  3/2012 LinkedIn has 161 million users •  6/2012 Twitter discloses 140 million MAU •  4/2012 Facebook declares 901 million MAU
  • 5. Example applications •  Ranking ▪  Popularity, importance, etc. •  Label Propagation ▪  Location, school, gender, etc. •  Community ▪  Groups, interests
  • 6. Bulk synchronous parallel •  Supersteps ▪  A global epoch followed by a global barrier where components do concurrent computation and send messages •  Point-to-point messages (i.e. vertex to vertex) ▪  Sent during a superstep from one component to another and then delivered in the following superstep •  Computation complete when all components complete
  • 7. Computation + Superstep Communication Processors Time Barrier
  • 8. MapReduce -> Giraph “Think like a vertex”, not a key-value pair! MapReduce Giraph public class Mapper< public class Vertex< KEYIN, I extends VALUEIN, WritableComparable, KEYOUT, V extends Writable, VALUEOUT> { E extends Writable, void map(KEYIN key, M extends Writable> { VALUEIN value, void compute( Context context) Iterator<M> msgIterator); throws IOException, } InterruptedException; }
  • 9. Basic Giraph API Methods available to compute() Immediate effect/access Next superstep I getVertexId() void sendMsg(I id, M msg) V getVertexValue() void sendMsgToAllEdges(M msg) void setVertexValue(V vertexValue) void addVertexRequest( Iterator<I> iterator() BasicVertex<I, V, E, M> vertex) E getEdgeValue(I targetVertexId) void removeVertexRequest(I vertexId) boolean hasEdge(I targetVertexId) void addEdgeRequest( boolean addEdge(I targetVertexId, I sourceVertexId, E Edge<I, E> edge) edgeValue) void removeEdgeRequest( E removeEdge(I targetVertexId) I sourceVertexId, I destVertexId) void voteToHalt() boolean isHalted()
  • 10. Why not implement Giraph with multiple MapReduce jobs? •  Too much disk, no in-memory caching, a superstep becomes a job! Input Map Intermediate Reduce Output format tasks files tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  • 11. Giraph is a single Map-only job in Hadoop •  Hadoop is purely a resource manager for Giraph, all communication is done through Netty-based IPC Vertex input Map Vertex output format tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  • 12. Maximum vertex value implementation public class MaxValueVertex extends EdgeListVertex< IntWritable, IntWritable, IntWritable, IntWritable> { @Override public void compute(Iterator<IntWritable> msgIterator) { boolean changed = false; while (msgIterator.hasNext()) { IntWritable msgValue = msgIterator.next(); if (msgValue.get() > getVertexValue().get()) { setVertexValue(msgValue); changed = true; } } if (getSuperstep() == 0 || changed) { sendMsgToAllEdges(getVertexValue()); } else { voteToHalt(); } } }
  • 13. Maximum vertex value Processor 1 5 5 5 5 Processor 2 1 1 5 5 5 2 2 2 5 5 Time Barrier Barrier Barrier
  • 14. Page rank implementation public class SimplePageRankVertex extends EdgeListVertex<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Iterator<DoubleWritable> msgIterator) { if (getSuperstep() >= 1) { double sum = 0; while (msgIterator.hasNext()) { sum += msgIterator.next().get(); } setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f * sum); } if (getSuperstep() < 30) { long edges = getNumOutEdges(); sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } } }
  • 16. Giraph components •  Master – Application coordinator ▪  One active master at a time ▪  Assigns partition owners to workers prior to each superstep ▪  Synchronizes supersteps •  Worker – Computation & messaging ▪  Loads the graph from input splits ▪  Does the computation/messaging of its assigned partitions •  ZooKeeper ▪  Maintains global application state
  • 17. Graph distribution •  Master graph partitioner ▪  Create initial partitions, generate partition owner changes between supersteps •  Worker graph partitioner ▪  Determine which partition a vertex belongs to ▪  Create/modify the partition stats (can split/merge partitions) •  Default is hash partitioning (hashCode()) ▪  Range-based partitioning is also possible on a per-type basis
  • 18. Graph distribution example Partition 0 Load/Store Stats 0 Worker 0 Compute Partition 1 Messages Stats 1 Partition 2 Load/Store Stats 2 Master Worker 1 Compute Partition 3 Messages Stats 3 Partition 4 Load/Store Stats 4 Worker 2 Compute Partition 5 Messages Stats 5 Partition 6 Load/Store Stats 6 Worker 3 Compute Partition 7 Messages Stats 7
  • 19. Customizable fault tolerance •  No single point of failure from Giraph threads ▪  With multiple master threads, if the current master dies, a new one will automatically take over. ▪  If a worker thread dies, the application is rolled back to a previously checkpointed superstep. The next superstep will begin with the new amount of workers ▪  If a zookeeper server dies, as long as a quorum remains, the application can proceed •  Hadoop single points of failure still exist ▪  Namenode, jobtracker ▪  Restarting manually from a checkpoint is always possible 19
  • 20. Master thread fault tolerance Before failure of active master 0 After failure of active master 0 “Active” “Active” Master 0 Master 0 Active Active Master Master “Spare” State “Active” State Master 1 Master 1 “Spare” “Spare” Master 2 Master 2 •  One active master, with spare masters taking over in the event of an active master failure •  All active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails •  “Active” master implemented as a queue in ZooKeeper 20
  • 21. Worker thread fault tolerance Superstep i Superstep i+1 Superstep i+2 (no checkpoint) (checkpoint) (no checkpoint) Worker failure! Superstep i+1 Superstep i+2 Superstep i+3 (checkpoint) (no checkpoint) (checkpoint) Worker failure after checkpoint complete! Superstep i+3 Application (no checkpoint) Complete •  A single worker death fails the superstep •  Application reverts to the last committed superstep automatically ▪  Master detects worker failure during any superstep with a ZooKeeper “health” znode ▪  Master chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep 21
  • 22. Optional features •  Combiners ▪  Similar to Map-Reduce combiners ▪  Users implement a combine() method that can reduce the amount of messages sent and received ▪  Run on both the client side (memory, network) and server side (memory) •  Aggregators ▪  Similar to MPI aggregation routines (i.e. max, min, sum, etc.) ▪  Commutative and associate operations that are performed globally ▪  Examples include global communication, monitoring, and statistics
  • 23. Recent Netty IPC implementation 300 50 250 Time (Seconds) •  Big improvement over the 40 Hadoop RPC implementation 200 30 150 •  10-39% overall performance 20 improvement 100 50 10 •  Still need more Netty tuning 0 0 10 30 50 Workers Netty Hadoop RPC % improvement
  • 24. Recent benchmarks •  Test cluster of 80 machines ▪  Facebook Hadoop (https://github.com/facebook/hadoop-20) ▪  72 cores, 64+ GB of memory ▪  org.apache.giraph.benchmark.PageRankBenchmark ▪  5 supersteps ▪  No checkpointing ▪  10 edges per vertex
  • 25. Worker scalability 3000 Time (Seconds) 2500 2000 1500 1000 500 0 10 20 30 40 45 50 Workers
  • 26. Edge Scalability 5000 Time (Seconds) 4000 3000 2000 1000 0 1 2 3 4 5 Edges (Billions)
  • 27. Worker / edge scalability 2000 8 Time (Seconds) Edges (Billions) 1500 6 1000 4 500 2 0 0 10 30 50 Workers Run Time Workers/Edges
  • 28. Apache Giraph has graduated as of 5/2012 •  Incubated for less than a year (entered incubator 9/12) •  Committers from HortonWorks, Twitter, LinkedIn, Facebook, TrendMicro and various schools (VU Amsterdam, TU Berlin, Korea University) •  Released 0.1 as of 2/6/2012, will be release 0.2 within a few months
  • 29. Future improvements •  Out-of-core messages/graph ▪  Under memory pressure, dump messages/portions of the graph to local disk ▪  Ability to run applications without having all needed memory •  Performance improvements ▪  Netty is a good step in the right direction, but need to tune messaging performance as it takes up a majority of the time ▪  Scale back use of ZooKeeper to only be for health registration, rather than implementing aggregators and coordination
  • 30. More future improvements •  Adding a master#compute() method ▪  Arbitrary master computation that sends results to workers prior to a superstep to simplify certain algorithms ▪  GIRAPH-127 •  Handling skew ▪  Some vertices have a large number of edges and we need to break them up and handle them differently to provide better scalability
  • 31. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  • 32. Sessions will resume at 4:30pm Page 32