SlideShare une entreprise Scribd logo
1  sur  115
Télécharger pour lire hors ligne
Jose Quesada
Director, Data Science Retreat
jose@datascienceretreat.com
@quesada
• Mentors are world-class. CTOs, library authors, inventors,
founders of fast-growing companies, etc
• DSR accepts fewer than 5% of the applications
• Strong focus on commercial awareness
• 5 years of working experience on average
• 30+ partner companies in Europe
DSR participants do a portfolio
project
Why is DSR talking about Scala/Spark?
They are b
IBM is behind this
They hired
What is a good question?
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
Does he look like a bitch?
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
The question: When should I tweet
to influence the right account?

Or ‘beat Buffer at their own game’
What is a good question?
• Business case
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
Overlap Tweet hours
Tweet frequency per UTC hour
What is a good question?
• Business case
• Data available
24GB
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
Graph theory parts we can
use to solve this problem
Graph theory primer
• Random walk
• Shortest path
• Sampling
Sampling in networks
Sampling in Networks
Note that sampling in Networks is fraught with difficulties. One cannot simply
sample the edges and nodes and expect that the sample be representative of the
original network. In the graph below, a sample that missed node 1 or 2 would
disconnect the two clusters, and would not have the same properties as the
original
Node 11
Node 2
Random surfer
Random surfer
A
B
C
D
Random surfer
A
B
C
D
Random surfer
A
B
C
D
E
Visited more often:
• Nodes with many links
• Coming from frequently visited nodes
Computing Pagerank
 
 
A
B
C
D
E
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
   
  
 
 
 
 
Teleport
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α)
 
 
 
 
 
 
(1-α)
α
Personalized pagerank
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α). When
teleporting, go to target
node
 
 
 
 
 
(1-α)
Personalized pagerank
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α). When
teleporting, go to target
node
(1-α)
α
Personalized pagerank
• Special case of Pagerank with priors (distribution of weights
over the nodes)
Implementation
A partitioned, distributed graph processing engine
is significantly more complex and difficult to build
GraphX and graphframes (new in spark
2.0)
• GraphX is to RDD as graphframe is to dataframe
• GraphX is lower level, and the API is scala-only. Graphframe is
very new:
• It’s not designed to be a graph database, as neo4J. Nodes and
edges can contain metadata, but the query engine is not as
complete as cypher
Advantages of graphframes
• Graphframes have a python API
• Graphframes give you simple querying for free.  GraphFrame
vertices and edges are stored as DataFrames, many queries are
just DataFrame (or SQL) queries
• They contain most of the algorithms in graphX, but the API is
less well-tested
• Pyspark shell instead of spark-shell
Distributed PageRank
• Problem: Computing PageRank on graph too large for one
machine
• Algorithm:
– Shard edges randomly,
– compute on each machine
– average results
• Basic idea: Duplicate edges from low-degree nodes. Gives an
unbiased estimator
• Nodes: 41.652.230
• Edges:
1.468.365.182
Summary of implementation, benefits
• Graph theory is a really flexible way to represent a problem
• Data structures to represent graphs are mature
• You can do now out-of-core, distributed graph analysis for
cheap
• Implementations are there for even state-of-the-art methods
Summary, finding a problem
• We live in an age of abundance (methods, data, hardware, ideas)
• Finding the question is more than half of the battle
• I had about a week to prepare this talk, but I managed to put
together something that showcases what you can do with large
graphs today, and it could be effective as a startup idea
• My question is not great because you cannot demonstrate that it
works till you use it (common problem for unsupervised methods)
The question: When should I tweet
to influence the right account?

Or ‘beat Buffer at their own game’
References: Drawing graphs
• Graphs in this slide set have been drawn with Gephi
• If you use Zeppelin notebook, you can draw graphs with:
drawGraph(org.apache.spark.graphx.util.
GraphGenerators.rmatGraph(sc,32,60))


25 videos explaining ML on spark, 50 more
to come. A bunch on graphX
• For people who already know ML
• http://datascienceretreat.com/videos/data-science-with-
scala-and-spark
About learning new tech over seven
weekends…
About learning new tech over seven
weekends
• You have time and enjoy using it to learn alone: learn it ‘the
hard way’
• You are extremely motivated and talented, have money: Apply
for DSR
• You want your weekends for yourself. You are already very
good but want to switch jobs. Apply for codekitt
Thanks!
Jose Quesada
Director, Data Science Retreat
jose@datascienceretreat.com
@quesada
http://datascienceretreat.com/
codekitt.com

Contenu connexe

Tendances

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 

Tendances (20)

MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 

En vedette

En vedette (9)

Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...
Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...
Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...
 
Distributed Graph Analytics with Gradoop
Distributed Graph Analytics with GradoopDistributed Graph Analytics with Gradoop
Distributed Graph Analytics with Gradoop
 
Graph technology meetup slides
Graph technology meetup slidesGraph technology meetup slides
Graph technology meetup slides
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning World
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 

Similaire à Distributed processing of large graphs in python

Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
Discover Pinterest
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
Jeff Heaton
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
Doug Needham
 

Similaire à Distributed processing of large graphs in python (20)

Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
UNit4.pdf
UNit4.pdfUNit4.pdf
UNit4.pdf
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
Leveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul GamesLeveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul Games
 
Data Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadData Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari Prasad
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship CultureTechnical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analytics
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Hofstra University - Overview of Big Data
Hofstra University - Overview of Big DataHofstra University - Overview of Big Data
Hofstra University - Overview of Big Data
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 

Dernier

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Dernier (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 

Distributed processing of large graphs in python

  • 1. Jose Quesada Director, Data Science Retreat jose@datascienceretreat.com @quesada
  • 2.
  • 3. • Mentors are world-class. CTOs, library authors, inventors, founders of fast-growing companies, etc • DSR accepts fewer than 5% of the applications • Strong focus on commercial awareness • 5 years of working experience on average • 30+ partner companies in Europe
  • 4.
  • 5. DSR participants do a portfolio project
  • 6.
  • 7. Why is DSR talking about Scala/Spark? They are b IBM is behind this They hired
  • 8.
  • 9. What is a good question?
  • 10. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  • 11. Does he look like a bitch?
  • 12. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  • 13. The question: When should I tweet to influence the right account?
 Or ‘beat Buffer at their own game’
  • 14. What is a good question? • Business case
  • 15. DJ J & MAX RECORDS
  • 16. DJ J & MAX RECORDS
  • 17. DJ J & MAX RECORDS
  • 18. DJ J & MAX RECORDS
  • 19. DJ J & MAX RECORDS
  • 20.
  • 21.
  • 22. DJ J & MAX RECORDS
  • 23. Overlap Tweet hours Tweet frequency per UTC hour
  • 24. What is a good question? • Business case • Data available
  • 25. 24GB
  • 26. What is a good question? • Business case • Data available • Technology to answer the question is available
  • 27. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  • 28. Graph theory parts we can use to solve this problem
  • 29. Graph theory primer • Random walk • Shortest path • Sampling
  • 31. Sampling in Networks Note that sampling in Networks is fraught with difficulties. One cannot simply sample the edges and nodes and expect that the sample be representative of the original network. In the graph below, a sample that missed node 1 or 2 would disconnect the two clusters, and would not have the same properties as the original Node 11 Node 2
  • 32.
  • 33.
  • 37. Random surfer A B C D E Visited more often: • Nodes with many links • Coming from frequently visited nodes
  • 48. Teleport A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α)             (1-α) α
  • 49. Personalized pagerank A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α). When teleporting, go to target node           (1-α)
  • 50. Personalized pagerank A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α). When teleporting, go to target node (1-α) α
  • 51. Personalized pagerank • Special case of Pagerank with priors (distribution of weights over the nodes)
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 101. A partitioned, distributed graph processing engine is significantly more complex and difficult to build
  • 102. GraphX and graphframes (new in spark 2.0) • GraphX is to RDD as graphframe is to dataframe • GraphX is lower level, and the API is scala-only. Graphframe is very new: • It’s not designed to be a graph database, as neo4J. Nodes and edges can contain metadata, but the query engine is not as complete as cypher
  • 103. Advantages of graphframes • Graphframes have a python API • Graphframes give you simple querying for free.  GraphFrame vertices and edges are stored as DataFrames, many queries are just DataFrame (or SQL) queries • They contain most of the algorithms in graphX, but the API is less well-tested • Pyspark shell instead of spark-shell
  • 104. Distributed PageRank • Problem: Computing PageRank on graph too large for one machine • Algorithm: – Shard edges randomly, – compute on each machine – average results • Basic idea: Duplicate edges from low-degree nodes. Gives an unbiased estimator
  • 105. • Nodes: 41.652.230 • Edges: 1.468.365.182
  • 106.
  • 107.
  • 108. Summary of implementation, benefits • Graph theory is a really flexible way to represent a problem • Data structures to represent graphs are mature • You can do now out-of-core, distributed graph analysis for cheap • Implementations are there for even state-of-the-art methods
  • 109. Summary, finding a problem • We live in an age of abundance (methods, data, hardware, ideas) • Finding the question is more than half of the battle • I had about a week to prepare this talk, but I managed to put together something that showcases what you can do with large graphs today, and it could be effective as a startup idea • My question is not great because you cannot demonstrate that it works till you use it (common problem for unsupervised methods)
  • 110. The question: When should I tweet to influence the right account?
 Or ‘beat Buffer at their own game’
  • 111. References: Drawing graphs • Graphs in this slide set have been drawn with Gephi • If you use Zeppelin notebook, you can draw graphs with: drawGraph(org.apache.spark.graphx.util. GraphGenerators.rmatGraph(sc,32,60)) 

  • 112. 25 videos explaining ML on spark, 50 more to come. A bunch on graphX • For people who already know ML • http://datascienceretreat.com/videos/data-science-with- scala-and-spark
  • 113. About learning new tech over seven weekends…
  • 114. About learning new tech over seven weekends • You have time and enjoy using it to learn alone: learn it ‘the hard way’ • You are extremely motivated and talented, have money: Apply for DSR • You want your weekends for yourself. You are already very good but want to switch jobs. Apply for codekitt
  • 115. Thanks! Jose Quesada Director, Data Science Retreat jose@datascienceretreat.com @quesada http://datascienceretreat.com/ codekitt.com