SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
Link Analysis:
Find important nodes in large-scale network
Yusuke Yamamoto
Lecturer, Faculty of Informatics
yusuke_yamamoto@acm.org
Data Engineering (Recommender Systems 4)
2019.11.18
Graph data
2
A graph is a data structure consisting of
collection of nodes and edges (links).
Each edge represent the relation which exists between two nodes.
Graph data is often observed in real life
3Image from William L. Hamilton’s COMP551 special topic lecture
Paper citation networks Web
Important nodes in graphs
4Image from William L. Hamilton’s COMP551 special topic lecture
We often want to know
which nodes are important in graph.
Who is the most
influential person?
Which is the best paper? Which is the most
popular webpage?
Paper citation networks Web
Important nodes in graphs
5Image from William L. Hamilton’s COMP551 special topic lecture
We often want to know
which nodes are important in graph.
Who is the most
influential person?
Which is the best paper? Which is the most
popular webpage?
Paper citation networks Web
How can we compute
the importance of nodes in graph?Q.
Link analysis can help you!!A.
What do we learn today?
6
PageRank
Topic-sensitive PageRank
1.
2.
1
7
PageRank
Google introduced a new method to evaluate webpages
The objective of PageRank
8
A
C D
B
E
Importance Ranking
1. node B
2. node D
3. node A
4. node C
5. node E
0.40pt
0.26pt
0.20pt
0.11pt
0.03pt
Based on graph structure,
PageRank evaluates and ranks webpages
Web graph
(Hyperlink structure)
Simple method to evaluate webpage importance
9
Simple assumption (majority voting)
If a webpage is linked by a lot of webpages,
the webpage can be important.
A
C D
B
E
#in-links = 3
#in-links = 2 #in-links = 2
Simple method to evaluate webpage importance
10
Simple assumption (majority voting)
If a webpage is linked by a lot of webpages,
the webpage can be important.
A
C D
B
E
#in-links = 3
#in-links = 2 #in-links = 2
Is this assumption enough OK?
Problems on simple link counting (1/2)
11
A
C D
B
E
Malicious websites can easily their scores
by creating ‘spam farm’ of a million pages
#in-links: 2
Problems on simple link counting (1/2)
12
A
C D
B
E
#in-links: 2 ⇒ 100
Malicious websites can easily their scores
by creating ‘spam farm’ of a million pages
M
M
M
M
M
M
Spam farm (98 pages)
Problems on simple link counting (2/2)
13
Simple method doesn’t consider whether
where a webpage is linked by
important pages or non-important pages
A
C D
B
E
#in-links: 3
#in-links: 2 #in-links: 2
linked by B whose #in-link=3
linked by E whose #in-link=0
Which is more important, node C or D?
Basic idea of PageRank
If a page is linked by a lot of IMPORTANT pages,
the page can be important
Assumption
A
C D
B
E
more important than E
#in-links: 2 #in-links: 2
D is more important than C
because D is linked by more important node (B) than D
Another interpretation of basic idea of PageRank
15
People are more likely to visit more important pages
1.When people are browsing a page, we assume that
they randomly select links in it for next browsing
2.People are likely to move from more important pages
to a page than less important ones, following links.
A
C D
B
E
With highest chance
of people to visit!!
How can we calculate the likelihood to visit?
3.
Toy example to check the basic idea of PageRank
16
A
C D
B
Q.Suppose that a random surfer is now at A.
He randomly selects one of links in each page to
decide which page he will visit.
Which page has the highest chance
of him to (re-)visit?
Prob. = 1
Prob. = 0
Prob. = 0
Prob. = 0
Toy example to check the basic idea of PageRank
17
A
C D
B
Q. Which page has the highest chance
of him to (re-)visit?
A surfer randomly select a link to move
1/3
1/3
Transition probability
1/3
Toy example to check the basic idea of PageRank
18
A
C D
B
Q. Which page has the highest chance
of him to (re-)visit?
1/3
1/3
Transition probability
1/3
What are the chances that he will be on nodes B or C after his first transition?
Toy example to check the basic idea of PageRank
19
A
C D
B
Q. Which page has the highest chance
of him to (re-)visit?
1x(1/3)=1/3
1x(1/3)=1/3 1x(1/3)=1/3
0
1/3
1/3
1/3
Toy example to check the basic idea of PageRank
20
A
C D
B
Q. Which page has the highest chance
of him to (re-)visit?
1x(1/3)=1/3
1x(1/3)=1/3 1x(1/3)=1/3
0
To which node will he move next?
Toy example to check the basic idea of PageRank
21
A
C D
B
Q. Which page has the highest chance
of him to (re-)visit?
1/2
1/21
Transition probability
1/2
1/2
Toy example to check the basic idea of PageRank
22
A
C D
B
Q. Which page has the highest chance
of him to (re-)visit?
1/2
1/21
Transition probability
1/2
1/2
What are the chances that he will be on each node after the two times transition?
Toy example to check the basic idea of PageRank
23
A
C D
B
Q. Which page has the highest chance
of him to (re-)visit?
1/2
1/21
Transition probability
1/2
1/2
1/3
1/3 1/3
0
Toy example to check the basic idea of PageRank
24
A
C D
B
Q. Which page has the highest chance
of him to (re-)visit?
1/2
1/21
Transition probability
1
3
×
1
2
+
1
3
×1 =
1
2
1
3
×
1
2
+
1
3
×0 =
1
6
1
3
×
1
2
+ 0×
1
3
=
1
6
0×
1
3
+
1
3
×
1
2
=
1
6
1/2
1/2
Toy example to check the basic idea of PageRank
25
Q. Which page has the highest chance
of him to (re-)visit?
0 1 2 3 4 5
A 1 0 0.5 0.25 0.375 0.313
B 0 0.333 0.167 0.25 0.208 0.229
C 0 0.333 0.167 0.25 0.208 0.229
D 0 0.333 0.167 0.25 0.208 0.229
Node
Iter.
Probability change in each iteration
Toy example to check the basic idea of PageRank
26
Q. Which page has the highest chance
of him to (re-)visit?
0 5 10 20 … 1000
A 1 0.313 0.334 0.333 0.333
B 0 0.229 0.222 0.222 0.222
C 0 0.229
0.222 0.222
0.222
D 0 0.229 0.222 0.222 0.222
Node
Iter.
When transition repeats, each
probability will be converged.
The prob. mean the likelihood
of people to visit (i.e., PageRank)
Probability change in each iteration
Mathematical procedure to calculate simple PageRank (1/4)
27
Initial probability of being on each node
𝒓 𝟎 =
1
0
0
0
Transition probability from node to node
𝑴 =
0 1/3 1/3 1/3
1/2 0 0 1/2
0 0 1 0
0 1/2 1/2 0
A
C D
B
Prob.=1
Prob.=0
Prob.=0
Prob.=0
1/2
1/3
1/3
1/3
1/2
1/2 1/21
Mathematical procedure to calculate simple PageRank (2/4)
28
𝒓 𝟏 = 𝑴𝒓 𝟎
=
0 1/3 1/3 1/3
1/2 0 0 1/2
0 0 1 0
0 1/2 1/2 0
1
0
0
0
Mathematical procedure to calculate simple PageRank (3/4)
29
𝒓 𝟐 = 𝑴𝒓 𝟏
=
0 1/3 1/3 1/3
1/2 0 0 1/2
0 0 1 0
0 1/2 1/2 0
𝟐
1
0
0
0
= 𝑴𝑴𝒓 𝟎
= 𝑴 𝟐
𝒓 𝟎
Mathematical procedure to calculate simple PageRank (4/4)
30
𝒓 𝒏 = 𝑴𝒓 𝒏1𝟏
= 𝑴𝑴𝒓 𝒏1𝟐
= 𝑴 𝟐 𝒓 𝒏1𝟐
= 𝑴 𝒏 𝒓 𝟎
…
If n is enough large or rn has converged, we think
rn represents the likelihood of people to visit
Problems of simple PageRank (1/3)
31
A
C D
B A
C D
B
Dead end Spider trap
Several of link structures violate
the PageRank assumption
Problems of simple PageRank (2/3)
32
A
C D
B
Dead end
Several of link structures violate
the PageRank assumption
0 1 10 100
A 1 0 0.01 0
B 0 0.333 0.015 0
C 0 0.333 0.015 0
D 0 0.333 0.015 0
Probability change in each iteration
Problems of simple PageRank (3/3)
33
A
C D
B
Spider trap
Several of link structures violate
the PageRank assumption
0 1 10 100
A 1 0 0.01 0
B 0 0.333 0.015 0
C 0 0.333 0.961 1
D 0 0.333 0.015 0
Probability change in each iteration
Revision of PageRank assumption (Complete PageRank)
34
1.When people are browsing a page, we assume that
they randomly select links in it for next browsing
2.Sometimes, people directly/randomly visit pages
without using hyperlinks (called, random jump)
A
C D
B
Most cases: people use links
A
C D
B
Sometimes: people directly jump
Algorithm of complete PageRank (1/5)
35
𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅
2.Starting with n = 0, update rn with the below formula
Corresponds to the case where
people use links to visit pages
Corresponds to the case where
people directly visit pages
1. Initialize r0 (randomly assign values to r0).
Set transition Matrix M and random surf vector d
Algorithm of complete PageRank (2/5)
36
𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅
Transition matrix
(which derived from link structure)
0 1/3 1/3 1/3
1/2 0 0 1/2
0 0 0 0
0 1/2 1/2 0
A
C D
B
1/3
1/3
1/3
1/2
1/2 1/2
2.Starting with n = 0, update rn with the below formula
1. Initialize r0 (randomly assign values to r0).
Set transition Matrix M and random surf vector d
Algorithm of complete PageRank (3/5)
37
𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅
Random surf vector of
people to directly visit pages
(uniform distribution of prob.)
1/4
1/4
1/4
1/4
A
C D
B
2.Starting with n = 0, update rn with the below formula
1. Initialize r0 (randomly assign values to r0).
Set transition Matrix M and random surf vector d
Algorithm of complete PageRank (4/5)
38
𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅
Probabilities (parameters) to decide
which of the two modes people use.
(Empirically, α is set in the range 0.8 to 0.9)
2.Starting with n = 0, update rn with the below formula
1. Initialize r0 (randomly assign values to r0).
Set transition Matrix M and random surf vector d
Algorithm of complete PageRank (5/5)
39
𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅
2.Starting with n = 0, update rn with the below formula
1. Initialize r0 (randomly assign values to r0).
Set transition Matrix M and random surf vector d
3.If rn is converged (it does not change), the algorithm
finishes. The converged rn is the PageRank!!
Simple PageRank vs. complete PageRank
40
A
C D
B
Spider trap
0 1 10 100
A 1 0 0.01 0
B 0 0.333 0.015 0
C 0 0.333 0.961 1
D 0 0.333 0.015 0
Probability change in each iteration
0 1 10 100
A 1 0.05 0.102 0.101
B 0 0.316 0.129 0.128
C 0 0.316 0.639 0.642
D 0 0.316 0.129 0.128
Complete
PageRank
Simple
PageRank
What can PageRank provide us?
41
PageRank can evaluate
centrality of nodes in graph (network) data
Influential people Good papers to cite Popular webpage
Paper citation networks Web
PageRank PageRank PageRank
2
42
Topic-sensitive PageRank
Improved PageRank to consider node’s topic
Issues of normal PageRank
43
Normal PageRank ignores what kinds of
topics each node is related to.
A
C D
B
E
■ Pages about medicine
■ Pages about cosmetic
F
G
Normal PageRank
1. Page C 0.282pt
2. A 0.174pt
3. F 0.133pt
4. D 0.132pt
5. B 0.093pt
6. E 0.092pt
7. G 0.092pt
Is Page C important from the viewpoint of medicine?
Which node is the most important about medicine?
44
■ Pages about medicine
■ Pages about cosmetic
A
C D
B
E F
G
Many pages link to C, but only one of them is about med.
A is linked by more pages about medicine than C
Issues of normal PageRank
45
Normal PageRank ignores what kinds of
topics each node is related to.
A
C D
B
E
■ Pages about medicine
■ Pages about cosmetic
F
G
Normal PageRank
1. Page C 0.282pt
2. A 0.174pt
3. F 0.133pt
4. D 0.132pt
5. B 0.093pt
6. E 0.092pt
7. G 0.092pt
Is Page C important from the viewpoint of medicine?
We sometimes want to find important
pages (nodes) about a certain topic.
If people often move to a page from
important pages about the topic, such
page should be important for the topic!
Assumption of Topic-sensitive PageRank
46
Normal PageRank
● People follow links in pages to visit other pages.
● They sometimes randomly visit pages without links.
Any kinds of
Topic-sensitive PageRank
● People follow links in pages to visit other pages.
● They sometimes randomly visit pages without links.
only a kind of
Algorithm of Topic-sensitive PageRank (1/2)
47
𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅
Starting with n = 0, update rn with the below formula
0 1/2 0 0 0 0 1
0 0 0 1/3 0 0 0
1/4 0 0 1/3 1/2 1 0
1/4 1/2 0 0 0 0 0
1/4 0 0 0 0 0 0
0 0 0 1/3 1/2 0 0
1/4 0 0 0 0 0 0
A
C D
B
E F
G
1
1/4
1/4
1/4
1/4
1/2
1/21/3
1/3
1/3
1
1/2
1/2
Algorithm of Topic-sensitive PageRank (2/2)
48
𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅
1/7
1/7
1/7
1/7
1/7
1/7
1/7
Normal
PageRank
A
C D
B
E F
G
1/4
1/4
1/4
0
0
0
1/4
Topic-sensitive
PageRank
Starting with n = 0, update rn with the below formula
Results of Topic-sensitive PageRank (TsPR)
49
● TsPR gives high scores to pages about target topics
■ Pages about medicine
■ Pages about cosmetic
A
C D
B
E F
G
Normal PageRank
1. C 0.282pt
2. A 0.174pt
3. F 0.133pt
4. D 0.132pt
5. B 0.093pt
6. E 0.092pt
7. G 0.092pt
Topic-sensitive PR
1. A 0.266pt
2. C 0.248pt
3. G 0.147pt
4. B 0.121pt
5. D 0.108pt
6. E 0.057pt
7. F 0.055pt
● Even if a page is not about target topics, if the page
is linked by important pages, TsPR gives high scores to it.
When do we use Topic-sensitive PageRank?
50
Finding important nodes in a graph
for target topics
1.
Finding important nodes for individual
users (personalizing PageRank)
2.
- For that, Give random surf values to only nodes for target topics
- If you know the nodes of a user to frequently visit, give random
surf values to only the nodes.
A
C D
B
E F
G
■ Pages which a user likes
𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅
0
1/3
0
0
1/3
0
1/3

Contenu connexe

Tendances

Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation LearningJure Leskovec
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithmsAnkit Raj
 
Kohonen self organizing maps
Kohonen self organizing mapsKohonen self organizing maps
Kohonen self organizing mapsraphaelkiminya
 
Advantages and disadvantages of hidden markov model
Advantages and disadvantages of hidden markov modelAdvantages and disadvantages of hidden markov model
Advantages and disadvantages of hidden markov modeljoshiblog
 
I.INFORMED SEARCH IN ARTIFICIAL INTELLIGENCE II. HEURISTIC FUNCTION IN AI III...
I.INFORMED SEARCH IN ARTIFICIAL INTELLIGENCE II. HEURISTIC FUNCTION IN AI III...I.INFORMED SEARCH IN ARTIFICIAL INTELLIGENCE II. HEURISTIC FUNCTION IN AI III...
I.INFORMED SEARCH IN ARTIFICIAL INTELLIGENCE II. HEURISTIC FUNCTION IN AI III...vikas dhakane
 
Bruteforce algorithm
Bruteforce algorithmBruteforce algorithm
Bruteforce algorithmRezwan Siam
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web searchEmrullah Delibas
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceFarzan Hajian
 

Tendances (20)

Search Engines
Search EnginesSearch Engines
Search Engines
 
Web spam
Web spamWeb spam
Web spam
 
Kmp
KmpKmp
Kmp
 
Sorting network
Sorting networkSorting network
Sorting network
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation Learning
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
 
Kohonen self organizing maps
Kohonen self organizing mapsKohonen self organizing maps
Kohonen self organizing maps
 
Advantages and disadvantages of hidden markov model
Advantages and disadvantages of hidden markov modelAdvantages and disadvantages of hidden markov model
Advantages and disadvantages of hidden markov model
 
I.INFORMED SEARCH IN ARTIFICIAL INTELLIGENCE II. HEURISTIC FUNCTION IN AI III...
I.INFORMED SEARCH IN ARTIFICIAL INTELLIGENCE II. HEURISTIC FUNCTION IN AI III...I.INFORMED SEARCH IN ARTIFICIAL INTELLIGENCE II. HEURISTIC FUNCTION IN AI III...
I.INFORMED SEARCH IN ARTIFICIAL INTELLIGENCE II. HEURISTIC FUNCTION IN AI III...
 
Bruteforce algorithm
Bruteforce algorithmBruteforce algorithm
Bruteforce algorithm
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Cloud computing lecture 7
Cloud computing lecture 7Cloud computing lecture 7
Cloud computing lecture 7
 
Divide and conquer
Divide and conquerDivide and conquer
Divide and conquer
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
 
A* Search Algorithm
A* Search AlgorithmA* Search Algorithm
A* Search Algorithm
 

Similaire à Link Analysis

Chapter8-Link_Analysis.pptx
Chapter8-Link_Analysis.pptxChapter8-Link_Analysis.pptx
Chapter8-Link_Analysis.pptxAmenahAbbood
 
Chapter8-Link_Analysis (1).pptx
Chapter8-Link_Analysis (1).pptxChapter8-Link_Analysis (1).pptx
Chapter8-Link_Analysis (1).pptxAmenahAbbood
 
page rank explication et exemple formule
page rank explication et exemple  formulepage rank explication et exemple  formule
page rank explication et exemple formuleRamiHarrathi1
 
Markov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdfMarkov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdfrayyverma
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystificationRaja R
 
Analysis Of Algorithm
Analysis Of AlgorithmAnalysis Of Algorithm
Analysis Of AlgorithmBashi9675
 
Page rank
Page rankPage rank
Page rankCarlos
 
Optimizing search engines
Optimizing search enginesOptimizing search engines
Optimizing search enginesSwapnil Kotwal
 
Reputation Systems I
Reputation Systems IReputation Systems I
Reputation Systems IYury Lifshits
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibEl Habib NFAOUI
 
Google finalversionfixed
Google finalversionfixedGoogle finalversionfixed
Google finalversionfixedJohnels
 

Similaire à Link Analysis (20)

Chapter8-Link_Analysis.pptx
Chapter8-Link_Analysis.pptxChapter8-Link_Analysis.pptx
Chapter8-Link_Analysis.pptx
 
Chapter8-Link_Analysis (1).pptx
Chapter8-Link_Analysis (1).pptxChapter8-Link_Analysis (1).pptx
Chapter8-Link_Analysis (1).pptx
 
page rank explication et exemple formule
page rank explication et exemple  formulepage rank explication et exemple  formule
page rank explication et exemple formule
 
Markov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdfMarkov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdf
 
Pagerank
PagerankPagerank
Pagerank
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 
Analysis Of Algorithm
Analysis Of AlgorithmAnalysis Of Algorithm
Analysis Of Algorithm
 
Page rank
Page rankPage rank
Page rank
 
Optimizing search engines
Optimizing search enginesOptimizing search engines
Optimizing search engines
 
Pagerank
PagerankPagerank
Pagerank
 
Reputation Systems I
Reputation Systems IReputation Systems I
Reputation Systems I
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Dm page rank
Dm page rankDm page rank
Dm page rank
 
Page rank2
Page rank2Page rank2
Page rank2
 
Google finalversionfixed
Google finalversionfixedGoogle finalversionfixed
Google finalversionfixed
 
Cloud Computing Project
Cloud Computing ProjectCloud Computing Project
Cloud Computing Project
 
Ranking Web Pages
Ranking Web PagesRanking Web Pages
Ranking Web Pages
 
Page rank method
Page rank methodPage rank method
Page rank method
 
Page Rank
Page RankPage Rank
Page Rank
 
Page Rank
Page RankPage Rank
Page Rank
 

Plus de Yusuke Yamamoto

Collaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFCollaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFYusuke Yamamoto
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFYusuke Yamamoto
 
データ解析技術2019
データ解析技術2019データ解析技術2019
データ解析技術2019Yusuke Yamamoto
 
研究室紹介資料2019
研究室紹介資料2019研究室紹介資料2019
研究室紹介資料2019Yusuke Yamamoto
 
ACM WebSci 2018 presentation/発表資料
ACM WebSci 2018 presentation/発表資料ACM WebSci 2018 presentation/発表資料
ACM WebSci 2018 presentation/発表資料Yusuke Yamamoto
 
不便益システムシンポジウム2018発表資料
不便益システムシンポジウム2018発表資料不便益システムシンポジウム2018発表資料
不便益システムシンポジウム2018発表資料Yusuke Yamamoto
 
KURA HOUR拡大版・附属図書館研究開発室セミナー 20180319
KURA HOUR拡大版・附属図書館研究開発室セミナー 20180319KURA HOUR拡大版・附属図書館研究開発室セミナー 20180319
KURA HOUR拡大版・附属図書館研究開発室セミナー 20180319Yusuke Yamamoto
 
批判的ウェブ情報探索リテラシー尺度の開発
批判的ウェブ情報探索リテラシー尺度の開発批判的ウェブ情報探索リテラシー尺度の開発
批判的ウェブ情報探索リテラシー尺度の開発Yusuke Yamamoto
 
東北地区大学図書館協議会 第72回総会講演資料20170922
東北地区大学図書館協議会 第72回総会講演資料20170922東北地区大学図書館協議会 第72回総会講演資料20170922
東北地区大学図書館協議会 第72回総会講演資料20170922Yusuke Yamamoto
 
WI2研究会 Vol.10発表資料20170708
WI2研究会 Vol.10発表資料20170708WI2研究会 Vol.10発表資料20170708
WI2研究会 Vol.10発表資料20170708Yusuke Yamamoto
 
情報学応用論20170622
情報学応用論20170622情報学応用論20170622
情報学応用論20170622Yusuke Yamamoto
 
ビッグデータとITイノベーション
ビッグデータとITイノベーションビッグデータとITイノベーション
ビッグデータとITイノベーションYusuke Yamamoto
 
ウェブと研究者との関わり方20150302
ウェブと研究者との関わり方20150302ウェブと研究者との関わり方20150302
ウェブと研究者との関わり方20150302Yusuke Yamamoto
 
大学の研究力を考える
大学の研究力を考える大学の研究力を考える
大学の研究力を考えるYusuke Yamamoto
 
研究力DOWNシナリオ
研究力DOWNシナリオ研究力DOWNシナリオ
研究力DOWNシナリオYusuke Yamamoto
 
URAかるた 〜URA業務の理解・共有を促進するゲーム教材
URAかるた 〜URA業務の理解・共有を促進するゲーム教材URAかるた 〜URA業務の理解・共有を促進するゲーム教材
URAかるた 〜URA業務の理解・共有を促進するゲーム教材Yusuke Yamamoto
 

Plus de Yusuke Yamamoto (20)

WISE2019 presentation
WISE2019 presentationWISE2019 presentation
WISE2019 presentation
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Collaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFCollaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CF
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CF
 
データ解析技術2019
データ解析技術2019データ解析技術2019
データ解析技術2019
 
研究室紹介資料2019
研究室紹介資料2019研究室紹介資料2019
研究室紹介資料2019
 
ACM WebSci 2018 presentation/発表資料
ACM WebSci 2018 presentation/発表資料ACM WebSci 2018 presentation/発表資料
ACM WebSci 2018 presentation/発表資料
 
不便益システムシンポジウム2018発表資料
不便益システムシンポジウム2018発表資料不便益システムシンポジウム2018発表資料
不便益システムシンポジウム2018発表資料
 
KURA HOUR拡大版・附属図書館研究開発室セミナー 20180319
KURA HOUR拡大版・附属図書館研究開発室セミナー 20180319KURA HOUR拡大版・附属図書館研究開発室セミナー 20180319
KURA HOUR拡大版・附属図書館研究開発室セミナー 20180319
 
批判的ウェブ情報探索リテラシー尺度の開発
批判的ウェブ情報探索リテラシー尺度の開発批判的ウェブ情報探索リテラシー尺度の開発
批判的ウェブ情報探索リテラシー尺度の開発
 
東北地区大学図書館協議会 第72回総会講演資料20170922
東北地区大学図書館協議会 第72回総会講演資料20170922東北地区大学図書館協議会 第72回総会講演資料20170922
東北地区大学図書館協議会 第72回総会講演資料20170922
 
WI2研究会 Vol.10発表資料20170708
WI2研究会 Vol.10発表資料20170708WI2研究会 Vol.10発表資料20170708
WI2研究会 Vol.10発表資料20170708
 
情報学応用論20170622
情報学応用論20170622情報学応用論20170622
情報学応用論20170622
 
情報学総論20170623
情報学総論20170623情報学総論20170623
情報学総論20170623
 
情報学総論20170616
情報学総論20170616情報学総論20170616
情報学総論20170616
 
ビッグデータとITイノベーション
ビッグデータとITイノベーションビッグデータとITイノベーション
ビッグデータとITイノベーション
 
ウェブと研究者との関わり方20150302
ウェブと研究者との関わり方20150302ウェブと研究者との関わり方20150302
ウェブと研究者との関わり方20150302
 
大学の研究力を考える
大学の研究力を考える大学の研究力を考える
大学の研究力を考える
 
研究力DOWNシナリオ
研究力DOWNシナリオ研究力DOWNシナリオ
研究力DOWNシナリオ
 
URAかるた 〜URA業務の理解・共有を促進するゲーム教材
URAかるた 〜URA業務の理解・共有を促進するゲーム教材URAかるた 〜URA業務の理解・共有を促進するゲーム教材
URAかるた 〜URA業務の理解・共有を促進するゲーム教材
 

Dernier

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Dernier (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

Link Analysis

  • 1. Link Analysis: Find important nodes in large-scale network Yusuke Yamamoto Lecturer, Faculty of Informatics yusuke_yamamoto@acm.org Data Engineering (Recommender Systems 4) 2019.11.18
  • 2. Graph data 2 A graph is a data structure consisting of collection of nodes and edges (links). Each edge represent the relation which exists between two nodes.
  • 3. Graph data is often observed in real life 3Image from William L. Hamilton’s COMP551 special topic lecture Paper citation networks Web
  • 4. Important nodes in graphs 4Image from William L. Hamilton’s COMP551 special topic lecture We often want to know which nodes are important in graph. Who is the most influential person? Which is the best paper? Which is the most popular webpage? Paper citation networks Web
  • 5. Important nodes in graphs 5Image from William L. Hamilton’s COMP551 special topic lecture We often want to know which nodes are important in graph. Who is the most influential person? Which is the best paper? Which is the most popular webpage? Paper citation networks Web How can we compute the importance of nodes in graph?Q. Link analysis can help you!!A.
  • 6. What do we learn today? 6 PageRank Topic-sensitive PageRank 1. 2.
  • 7. 1 7 PageRank Google introduced a new method to evaluate webpages
  • 8. The objective of PageRank 8 A C D B E Importance Ranking 1. node B 2. node D 3. node A 4. node C 5. node E 0.40pt 0.26pt 0.20pt 0.11pt 0.03pt Based on graph structure, PageRank evaluates and ranks webpages Web graph (Hyperlink structure)
  • 9. Simple method to evaluate webpage importance 9 Simple assumption (majority voting) If a webpage is linked by a lot of webpages, the webpage can be important. A C D B E #in-links = 3 #in-links = 2 #in-links = 2
  • 10. Simple method to evaluate webpage importance 10 Simple assumption (majority voting) If a webpage is linked by a lot of webpages, the webpage can be important. A C D B E #in-links = 3 #in-links = 2 #in-links = 2 Is this assumption enough OK?
  • 11. Problems on simple link counting (1/2) 11 A C D B E Malicious websites can easily their scores by creating ‘spam farm’ of a million pages #in-links: 2
  • 12. Problems on simple link counting (1/2) 12 A C D B E #in-links: 2 ⇒ 100 Malicious websites can easily their scores by creating ‘spam farm’ of a million pages M M M M M M Spam farm (98 pages)
  • 13. Problems on simple link counting (2/2) 13 Simple method doesn’t consider whether where a webpage is linked by important pages or non-important pages A C D B E #in-links: 3 #in-links: 2 #in-links: 2 linked by B whose #in-link=3 linked by E whose #in-link=0 Which is more important, node C or D?
  • 14. Basic idea of PageRank If a page is linked by a lot of IMPORTANT pages, the page can be important Assumption A C D B E more important than E #in-links: 2 #in-links: 2 D is more important than C because D is linked by more important node (B) than D
  • 15. Another interpretation of basic idea of PageRank 15 People are more likely to visit more important pages 1.When people are browsing a page, we assume that they randomly select links in it for next browsing 2.People are likely to move from more important pages to a page than less important ones, following links. A C D B E With highest chance of people to visit!! How can we calculate the likelihood to visit? 3.
  • 16. Toy example to check the basic idea of PageRank 16 A C D B Q.Suppose that a random surfer is now at A. He randomly selects one of links in each page to decide which page he will visit. Which page has the highest chance of him to (re-)visit? Prob. = 1 Prob. = 0 Prob. = 0 Prob. = 0
  • 17. Toy example to check the basic idea of PageRank 17 A C D B Q. Which page has the highest chance of him to (re-)visit? A surfer randomly select a link to move 1/3 1/3 Transition probability 1/3
  • 18. Toy example to check the basic idea of PageRank 18 A C D B Q. Which page has the highest chance of him to (re-)visit? 1/3 1/3 Transition probability 1/3 What are the chances that he will be on nodes B or C after his first transition?
  • 19. Toy example to check the basic idea of PageRank 19 A C D B Q. Which page has the highest chance of him to (re-)visit? 1x(1/3)=1/3 1x(1/3)=1/3 1x(1/3)=1/3 0 1/3 1/3 1/3
  • 20. Toy example to check the basic idea of PageRank 20 A C D B Q. Which page has the highest chance of him to (re-)visit? 1x(1/3)=1/3 1x(1/3)=1/3 1x(1/3)=1/3 0 To which node will he move next?
  • 21. Toy example to check the basic idea of PageRank 21 A C D B Q. Which page has the highest chance of him to (re-)visit? 1/2 1/21 Transition probability 1/2 1/2
  • 22. Toy example to check the basic idea of PageRank 22 A C D B Q. Which page has the highest chance of him to (re-)visit? 1/2 1/21 Transition probability 1/2 1/2 What are the chances that he will be on each node after the two times transition?
  • 23. Toy example to check the basic idea of PageRank 23 A C D B Q. Which page has the highest chance of him to (re-)visit? 1/2 1/21 Transition probability 1/2 1/2 1/3 1/3 1/3 0
  • 24. Toy example to check the basic idea of PageRank 24 A C D B Q. Which page has the highest chance of him to (re-)visit? 1/2 1/21 Transition probability 1 3 × 1 2 + 1 3 ×1 = 1 2 1 3 × 1 2 + 1 3 ×0 = 1 6 1 3 × 1 2 + 0× 1 3 = 1 6 0× 1 3 + 1 3 × 1 2 = 1 6 1/2 1/2
  • 25. Toy example to check the basic idea of PageRank 25 Q. Which page has the highest chance of him to (re-)visit? 0 1 2 3 4 5 A 1 0 0.5 0.25 0.375 0.313 B 0 0.333 0.167 0.25 0.208 0.229 C 0 0.333 0.167 0.25 0.208 0.229 D 0 0.333 0.167 0.25 0.208 0.229 Node Iter. Probability change in each iteration
  • 26. Toy example to check the basic idea of PageRank 26 Q. Which page has the highest chance of him to (re-)visit? 0 5 10 20 … 1000 A 1 0.313 0.334 0.333 0.333 B 0 0.229 0.222 0.222 0.222 C 0 0.229 0.222 0.222 0.222 D 0 0.229 0.222 0.222 0.222 Node Iter. When transition repeats, each probability will be converged. The prob. mean the likelihood of people to visit (i.e., PageRank) Probability change in each iteration
  • 27. Mathematical procedure to calculate simple PageRank (1/4) 27 Initial probability of being on each node 𝒓 𝟎 = 1 0 0 0 Transition probability from node to node 𝑴 = 0 1/3 1/3 1/3 1/2 0 0 1/2 0 0 1 0 0 1/2 1/2 0 A C D B Prob.=1 Prob.=0 Prob.=0 Prob.=0 1/2 1/3 1/3 1/3 1/2 1/2 1/21
  • 28. Mathematical procedure to calculate simple PageRank (2/4) 28 𝒓 𝟏 = 𝑴𝒓 𝟎 = 0 1/3 1/3 1/3 1/2 0 0 1/2 0 0 1 0 0 1/2 1/2 0 1 0 0 0
  • 29. Mathematical procedure to calculate simple PageRank (3/4) 29 𝒓 𝟐 = 𝑴𝒓 𝟏 = 0 1/3 1/3 1/3 1/2 0 0 1/2 0 0 1 0 0 1/2 1/2 0 𝟐 1 0 0 0 = 𝑴𝑴𝒓 𝟎 = 𝑴 𝟐 𝒓 𝟎
  • 30. Mathematical procedure to calculate simple PageRank (4/4) 30 𝒓 𝒏 = 𝑴𝒓 𝒏1𝟏 = 𝑴𝑴𝒓 𝒏1𝟐 = 𝑴 𝟐 𝒓 𝒏1𝟐 = 𝑴 𝒏 𝒓 𝟎 … If n is enough large or rn has converged, we think rn represents the likelihood of people to visit
  • 31. Problems of simple PageRank (1/3) 31 A C D B A C D B Dead end Spider trap Several of link structures violate the PageRank assumption
  • 32. Problems of simple PageRank (2/3) 32 A C D B Dead end Several of link structures violate the PageRank assumption 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.015 0 D 0 0.333 0.015 0 Probability change in each iteration
  • 33. Problems of simple PageRank (3/3) 33 A C D B Spider trap Several of link structures violate the PageRank assumption 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.961 1 D 0 0.333 0.015 0 Probability change in each iteration
  • 34. Revision of PageRank assumption (Complete PageRank) 34 1.When people are browsing a page, we assume that they randomly select links in it for next browsing 2.Sometimes, people directly/randomly visit pages without using hyperlinks (called, random jump) A C D B Most cases: people use links A C D B Sometimes: people directly jump
  • 35. Algorithm of complete PageRank (1/5) 35 𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅 2.Starting with n = 0, update rn with the below formula Corresponds to the case where people use links to visit pages Corresponds to the case where people directly visit pages 1. Initialize r0 (randomly assign values to r0). Set transition Matrix M and random surf vector d
  • 36. Algorithm of complete PageRank (2/5) 36 𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅 Transition matrix (which derived from link structure) 0 1/3 1/3 1/3 1/2 0 0 1/2 0 0 0 0 0 1/2 1/2 0 A C D B 1/3 1/3 1/3 1/2 1/2 1/2 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0). Set transition Matrix M and random surf vector d
  • 37. Algorithm of complete PageRank (3/5) 37 𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅 Random surf vector of people to directly visit pages (uniform distribution of prob.) 1/4 1/4 1/4 1/4 A C D B 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0). Set transition Matrix M and random surf vector d
  • 38. Algorithm of complete PageRank (4/5) 38 𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅 Probabilities (parameters) to decide which of the two modes people use. (Empirically, α is set in the range 0.8 to 0.9) 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0). Set transition Matrix M and random surf vector d
  • 39. Algorithm of complete PageRank (5/5) 39 𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0). Set transition Matrix M and random surf vector d 3.If rn is converged (it does not change), the algorithm finishes. The converged rn is the PageRank!!
  • 40. Simple PageRank vs. complete PageRank 40 A C D B Spider trap 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.961 1 D 0 0.333 0.015 0 Probability change in each iteration 0 1 10 100 A 1 0.05 0.102 0.101 B 0 0.316 0.129 0.128 C 0 0.316 0.639 0.642 D 0 0.316 0.129 0.128 Complete PageRank Simple PageRank
  • 41. What can PageRank provide us? 41 PageRank can evaluate centrality of nodes in graph (network) data Influential people Good papers to cite Popular webpage Paper citation networks Web PageRank PageRank PageRank
  • 43. Issues of normal PageRank 43 Normal PageRank ignores what kinds of topics each node is related to. A C D B E ■ Pages about medicine ■ Pages about cosmetic F G Normal PageRank 1. Page C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt Is Page C important from the viewpoint of medicine?
  • 44. Which node is the most important about medicine? 44 ■ Pages about medicine ■ Pages about cosmetic A C D B E F G Many pages link to C, but only one of them is about med. A is linked by more pages about medicine than C
  • 45. Issues of normal PageRank 45 Normal PageRank ignores what kinds of topics each node is related to. A C D B E ■ Pages about medicine ■ Pages about cosmetic F G Normal PageRank 1. Page C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt Is Page C important from the viewpoint of medicine? We sometimes want to find important pages (nodes) about a certain topic. If people often move to a page from important pages about the topic, such page should be important for the topic!
  • 46. Assumption of Topic-sensitive PageRank 46 Normal PageRank ● People follow links in pages to visit other pages. ● They sometimes randomly visit pages without links. Any kinds of Topic-sensitive PageRank ● People follow links in pages to visit other pages. ● They sometimes randomly visit pages without links. only a kind of
  • 47. Algorithm of Topic-sensitive PageRank (1/2) 47 𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅 Starting with n = 0, update rn with the below formula 0 1/2 0 0 0 0 1 0 0 0 1/3 0 0 0 1/4 0 0 1/3 1/2 1 0 1/4 1/2 0 0 0 0 0 1/4 0 0 0 0 0 0 0 0 0 1/3 1/2 0 0 1/4 0 0 0 0 0 0 A C D B E F G 1 1/4 1/4 1/4 1/4 1/2 1/21/3 1/3 1/3 1 1/2 1/2
  • 48. Algorithm of Topic-sensitive PageRank (2/2) 48 𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅 1/7 1/7 1/7 1/7 1/7 1/7 1/7 Normal PageRank A C D B E F G 1/4 1/4 1/4 0 0 0 1/4 Topic-sensitive PageRank Starting with n = 0, update rn with the below formula
  • 49. Results of Topic-sensitive PageRank (TsPR) 49 ● TsPR gives high scores to pages about target topics ■ Pages about medicine ■ Pages about cosmetic A C D B E F G Normal PageRank 1. C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt Topic-sensitive PR 1. A 0.266pt 2. C 0.248pt 3. G 0.147pt 4. B 0.121pt 5. D 0.108pt 6. E 0.057pt 7. F 0.055pt ● Even if a page is not about target topics, if the page is linked by important pages, TsPR gives high scores to it.
  • 50. When do we use Topic-sensitive PageRank? 50 Finding important nodes in a graph for target topics 1. Finding important nodes for individual users (personalizing PageRank) 2. - For that, Give random surf values to only nodes for target topics - If you know the nodes of a user to frequently visit, give random surf values to only the nodes. A C D B E F G ■ Pages which a user likes 𝒓 𝒏 = 𝜶𝑴𝒓 𝒏1𝟏 + 1 − α 𝒅 0 1/3 0 0 1/3 0 1/3