SlideShare une entreprise Scribd logo
1  sur  65
Thorny path to the 
Large-Scale Graph 
Processing 
Zinoviev Alexey
About 
• I am a <graph theory, machine learning, traffic jams prediction, BigData 
algorythms> scientist 
• But I'm a <Java, JavaScript, Android, NoSQL, Hadoop, Spark> 
programmer
BigData & Graph Theory 
3/65
Big Data of old times 
• Astronomy 
• Weather 
• Trading 
• Sea routes 
• Battles
And now ... 
• Web graph 
• Facebook friend network 
• Gmail email graph 
• EU road network 
• Citation graph 
• PayPal transaction graph
Graph Number of 
vertexes 
Number of 
edges 
Volume Data/per day 
Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB 
Facebook 
1,1 * 10^9 160 * 10^9 1 PB 15 TB 
(friends 
graph) 
Road graph 
of EU 
18 * 10^6 42 * 10^6 20 GB 50 MB 
Road graph 
of this city 
250 000 460 000 500 MB 100 KB
Problems 
• Popularity rank (page rank) 
• Determining popular users, news, jobs, etc. 
• Shortest paths 
• Max flow 
• How are users, groups connected? 
• Clustering, semi-clustering 
• Max clique, triangle closure, label propagation algorithms 
• Finding related people, groups, interests
Node Centrality Problem 
• Verticies with high impact 
• Removal of important vertices reduces the reliability 
Cases: 
• Bioinformatics 
• Social connections 
• Road network 
• Spam detection 
• Recommendation system
Small World Problem 
Facebook 4.74 712 M 69 G 
Twitter 3.67 ---- 5G follows 
MSN Messenger 
(1 month) 
6.6 180 M 1.3 G arcs
Large graph processing tools 
15/65
Think like a vertex… 
• Majority of graph algorithms are iterative and traverse the graph in 
some way 
• Classic map-reduce overheads (job startup/shutdown, reloading data 
from HDFS, shuffling) 
• High complexity of graph problem reduction to key-value model 
• Iteration algorythms, but multiple chained jobs in M/R with full saving 
and reading of each state
Why not use MapReduce/Hadoop? 
• Example: PageRank, Google‘s 
famous algorithm for measuring the 
authority of a webpage based on the 
underlying network of hyperlinks 
• defined recursively: each vertex 
distributes its authority to its neighbors 
in equal proportions
Google Pregel 
• Distributed system especially developed for large scale graph 
processing 
• Bulk Synchronous Parallel (BSP) as execution model 
• Supersteps are atomic units of parallel computation 
• Any superstep can be restarted from a checkpoint (need not be user 
defined) 
• A new superstep provides an opportunity for rebalancing of 
components among available resources
Superstep in BSP
Vertex-centric BSP 
• Each vertex has an id, a value, a list of its adjacent vertex ids and the 
corresponding edge values 
• Each vertex is invoked in each superstep, can recompute its value and 
send messages to other vertices, which are delivered over superstep 
barriers 
• Advanced features : termination votes, combiners, aggregators, 
topology mutations
C++ API, Pregel
Apache Giraph 
23/65
Why Apache Giraph 
Pregel is proprietary, but: 
• Apache Giraph is an open source implementation of Pregel 
• Runs on standard Hadoop infrastructure 
• Computation is executed in memory 
• Can be a job in a pipeline(MapReduce, Hive) 
• Uses Apache ZooKeeperfor synchronization
Why Apache Giraph 
• No locks: message-based communication 
• No semaphores: global synchronization 
• Iteration isolation: massively parallelizable
ZooKeeper in Apache Giraph 
ZooKeeper: responsible for 
computation state 
• Partition/worker mapping 
• Global state: superstep 
• Checkpoint paths, aggregator 
values, statistics
Master in Apache Giraph 
Master: responsible for coordination 
• Assigns partitions to workers 
• Coordinates synchronization 
• Requests checkpoints 
• Aggregates aggregator values 
• Collects health statuses
Worker in Apache Giraph 
Worker: responsible for vertices 
• Invokes active vertices 
compute() function 
• Sends, receives and assigns 
messages 
• Computes local aggregation 
values
Scaling Giraph to a trillion edges
Fault tolerance 
No single point of failure from Giraph threads 
• With multiple master threads, if the current master dies, a new 
one will automatically take over. 
• If a worker thread dies, the application is rolled back to a 
previously checkpointed superstep. 
• If a zookeeper server dies, as long as a quorum remains, the 
application can proceed 
Hadoop single points of failure still exist (Namenode, jobtracker)
Worker Scalability, 250m nodes
Vertex scalability, 300 workers
Vertex/workers scalability
MapReduce vs Giraph 
6 machines with 2x8core Opteron CPUs, 4x1TB disks and 32GB RAM each, ran 1 
Giraph worker per core 
Wikipedia page link graph (6 million vertices, 200 million edges) 
PageRank on Hadoop/Mahout 
• 10 iterations approx. 29 minutes 
• average time per iteration: approx. 3 minutes 
PageRank on Giraph 
• 30 iterations took approx. 15 minutes 
• average time per iteration: approx. 30 seconds 
10x performance improvement
Okapi 
• Apache Mahout for graphs 
• Graph-based recommenders: ALS, 
SGD, SVD++, etc. 
• Graph analytics: Graph 
partitioning, Community Detection, 
K-Core, etc.
Giraph’s killer
Spark 
• MapReduce in memory 
• Up to 50x faster than Hadoop 
• Support for Shark (like Hive), MLlib 
(Machine learning), GraphX (graph 
processing) 
• RDD is a basic building block 
(immutable distributed collections of 
objects)
Spark in Hadoop old family
GraphX 
Supported algorythms 
● PageRank 
● Connected components 
● Label propagation 
● SVD++ 
● Strongly connected components 
● Triangle count
GraphChi 
• Asynchronous Disk-based version of GraphLab 
• Utilizing parallel sliding window 
• Very small number of non-sequential accessesto the disk 
• Graph does not fit in memory 
• Input graph is split into P disjoint intervals to balance edges, 
each associated with a shard 
• For Home deals ...
GraphChi
GraphChi
Road Networks 
46/65
Definition 
• Edge weights > 0 
• A few classes of roads 
• Lat/Lon attributes for each vertex 
• Subgraphs for cross-roads 
• Not so big as web graph 
• Static
Shortest path problem
AI
Full
Dijkstra
Bi-Directional
We need in fast system! 
• Response < 10 ms (with high accuracy) 
• Shortest path (SP) with O(n) 
• Preprocessing phase 
• Don’t keep all SP - O(n^2) 
• Use geo attributes 
• Using compression and recoding for 
disk storage 
• Network is stable
EU Road network 
Dijkstra ALT RE HH CH TN HL 
2 008 300 24 656 2444 462.0 94.0 1.8 0.3 
• ALT: [Goldberg & Harrelson 05], [Delling & Wagner 07] 
• RE: [Gutman 05], [Goldberg et al. 07] 
• HH: [Sanders & Schultes 06] 
• CH: [Geisberger et al. 08] 
• TN: [Geisberger et al. 08] 
• HL: [Abraham et al. 11]
A* with landmarks (ALT)
Reach (RE)
Transit nodes (TN) 
• Divide graph G on subgraphs G_i 
• Find R (subset of G_i) for each G_i 
• All sortest path in G_i across R 
• Build pairs (v_i, r_k) for each v_i where 
r_k is closest Transit Node 
• Calculate shortest paths between transit 
nodes in R 
• Save it!
TN + ALT
Special Cases 
59/65
Optimization problems 
• Unstable graph 
• Prerpocessing phase is meaningless 
• How to invest 1B $ in road network to minimize human time in 
traffic jams 
• How to invest 1M $ in road network to improve reliability before 
the flooding
Last steps ... 
• I/O Efficient Algorythms and Data Structures 
• Graphs and Memory Errors
Omsk
Novosibirsk
Novosibirsk, TN preprocessing
twitter + G+ + VK

Contenu connexe

Tendances

Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systemsPrashant Raaghav
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev
 
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...ScyllaDB
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big DataPingCAP
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & JupyterRaj Singh
 

Tendances (20)

Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systems
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and Go
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyter
 

Similaire à Thorny path to the Large-Scale Graph Processing (Highload++, 2014)

Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLMLconf
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsDonald Nguyen
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsYu Liu
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology EnthusiastsNebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology EnthusiastsNebula Graph
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017Kisung Kim
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 

Similaire à Thorny path to the Large-Scale Graph Processing (Highload++, 2014) (20)

Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and Experiments
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology EnthusiastsNebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 

Plus de Alexey Zinoviev

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolvesAlexey Zinoviev
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...Alexey Zinoviev
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Alexey Zinoviev
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsAlexey Zinoviev
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBAlexey Zinoviev
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Alexey Zinoviev
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining KindergartenAlexey Zinoviev
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...Alexey Zinoviev
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAlexey Zinoviev
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Alexey Zinoviev
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsAlexey Zinoviev
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"Alexey Zinoviev
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Alexey Zinoviev
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиAlexey Zinoviev
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)Alexey Zinoviev
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!Alexey Zinoviev
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSAlexey Zinoviev
 

Plus de Alexey Zinoviev (20)

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolves
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDB
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining Kindergarten
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find you
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphs
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерности
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOS
 

Dernier

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 

Dernier (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)

  • 1. Thorny path to the Large-Scale Graph Processing Zinoviev Alexey
  • 2. About • I am a <graph theory, machine learning, traffic jams prediction, BigData algorythms> scientist • But I'm a <Java, JavaScript, Android, NoSQL, Hadoop, Spark> programmer
  • 3. BigData & Graph Theory 3/65
  • 4. Big Data of old times • Astronomy • Weather • Trading • Sea routes • Battles
  • 5. And now ... • Web graph • Facebook friend network • Gmail email graph • EU road network • Citation graph • PayPal transaction graph
  • 6. Graph Number of vertexes Number of edges Volume Data/per day Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB Facebook 1,1 * 10^9 160 * 10^9 1 PB 15 TB (friends graph) Road graph of EU 18 * 10^6 42 * 10^6 20 GB 50 MB Road graph of this city 250 000 460 000 500 MB 100 KB
  • 7. Problems • Popularity rank (page rank) • Determining popular users, news, jobs, etc. • Shortest paths • Max flow • How are users, groups connected? • Clustering, semi-clustering • Max clique, triangle closure, label propagation algorithms • Finding related people, groups, interests
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Node Centrality Problem • Verticies with high impact • Removal of important vertices reduces the reliability Cases: • Bioinformatics • Social connections • Road network • Spam detection • Recommendation system
  • 13.
  • 14. Small World Problem Facebook 4.74 712 M 69 G Twitter 3.67 ---- 5G follows MSN Messenger (1 month) 6.6 180 M 1.3 G arcs
  • 15. Large graph processing tools 15/65
  • 16. Think like a vertex… • Majority of graph algorithms are iterative and traverse the graph in some way • Classic map-reduce overheads (job startup/shutdown, reloading data from HDFS, shuffling) • High complexity of graph problem reduction to key-value model • Iteration algorythms, but multiple chained jobs in M/R with full saving and reading of each state
  • 17. Why not use MapReduce/Hadoop? • Example: PageRank, Google‘s famous algorithm for measuring the authority of a webpage based on the underlying network of hyperlinks • defined recursively: each vertex distributes its authority to its neighbors in equal proportions
  • 18. Google Pregel • Distributed system especially developed for large scale graph processing • Bulk Synchronous Parallel (BSP) as execution model • Supersteps are atomic units of parallel computation • Any superstep can be restarted from a checkpoint (need not be user defined) • A new superstep provides an opportunity for rebalancing of components among available resources
  • 20. Vertex-centric BSP • Each vertex has an id, a value, a list of its adjacent vertex ids and the corresponding edge values • Each vertex is invoked in each superstep, can recompute its value and send messages to other vertices, which are delivered over superstep barriers • Advanced features : termination votes, combiners, aggregators, topology mutations
  • 22.
  • 24. Why Apache Giraph Pregel is proprietary, but: • Apache Giraph is an open source implementation of Pregel • Runs on standard Hadoop infrastructure • Computation is executed in memory • Can be a job in a pipeline(MapReduce, Hive) • Uses Apache ZooKeeperfor synchronization
  • 25.
  • 26. Why Apache Giraph • No locks: message-based communication • No semaphores: global synchronization • Iteration isolation: massively parallelizable
  • 27.
  • 28.
  • 29. ZooKeeper in Apache Giraph ZooKeeper: responsible for computation state • Partition/worker mapping • Global state: superstep • Checkpoint paths, aggregator values, statistics
  • 30. Master in Apache Giraph Master: responsible for coordination • Assigns partitions to workers • Coordinates synchronization • Requests checkpoints • Aggregates aggregator values • Collects health statuses
  • 31. Worker in Apache Giraph Worker: responsible for vertices • Invokes active vertices compute() function • Sends, receives and assigns messages • Computes local aggregation values
  • 32. Scaling Giraph to a trillion edges
  • 33. Fault tolerance No single point of failure from Giraph threads • With multiple master threads, if the current master dies, a new one will automatically take over. • If a worker thread dies, the application is rolled back to a previously checkpointed superstep. • If a zookeeper server dies, as long as a quorum remains, the application can proceed Hadoop single points of failure still exist (Namenode, jobtracker)
  • 37. MapReduce vs Giraph 6 machines with 2x8core Opteron CPUs, 4x1TB disks and 32GB RAM each, ran 1 Giraph worker per core Wikipedia page link graph (6 million vertices, 200 million edges) PageRank on Hadoop/Mahout • 10 iterations approx. 29 minutes • average time per iteration: approx. 3 minutes PageRank on Giraph • 30 iterations took approx. 15 minutes • average time per iteration: approx. 30 seconds 10x performance improvement
  • 38. Okapi • Apache Mahout for graphs • Graph-based recommenders: ALS, SGD, SVD++, etc. • Graph analytics: Graph partitioning, Community Detection, K-Core, etc.
  • 40. Spark • MapReduce in memory • Up to 50x faster than Hadoop • Support for Shark (like Hive), MLlib (Machine learning), GraphX (graph processing) • RDD is a basic building block (immutable distributed collections of objects)
  • 41. Spark in Hadoop old family
  • 42. GraphX Supported algorythms ● PageRank ● Connected components ● Label propagation ● SVD++ ● Strongly connected components ● Triangle count
  • 43. GraphChi • Asynchronous Disk-based version of GraphLab • Utilizing parallel sliding window • Very small number of non-sequential accessesto the disk • Graph does not fit in memory • Input graph is split into P disjoint intervals to balance edges, each associated with a shard • For Home deals ...
  • 47. Definition • Edge weights > 0 • A few classes of roads • Lat/Lon attributes for each vertex • Subgraphs for cross-roads • Not so big as web graph • Static
  • 49. AI
  • 50. Full
  • 53. We need in fast system! • Response < 10 ms (with high accuracy) • Shortest path (SP) with O(n) • Preprocessing phase • Don’t keep all SP - O(n^2) • Use geo attributes • Using compression and recoding for disk storage • Network is stable
  • 54. EU Road network Dijkstra ALT RE HH CH TN HL 2 008 300 24 656 2444 462.0 94.0 1.8 0.3 • ALT: [Goldberg & Harrelson 05], [Delling & Wagner 07] • RE: [Gutman 05], [Goldberg et al. 07] • HH: [Sanders & Schultes 06] • CH: [Geisberger et al. 08] • TN: [Geisberger et al. 08] • HL: [Abraham et al. 11]
  • 57. Transit nodes (TN) • Divide graph G on subgraphs G_i • Find R (subset of G_i) for each G_i • All sortest path in G_i across R • Build pairs (v_i, r_k) for each v_i where r_k is closest Transit Node • Calculate shortest paths between transit nodes in R • Save it!
  • 60. Optimization problems • Unstable graph • Prerpocessing phase is meaningless • How to invest 1B $ in road network to minimize human time in traffic jams • How to invest 1M $ in road network to improve reliability before the flooding
  • 61. Last steps ... • I/O Efficient Algorythms and Data Structures • Graphs and Memory Errors
  • 62. Omsk
  • 65. twitter + G+ + VK