1. Network of Movie Stars
Social Network Analysis
Programming Project
moogway@outlook.com
Copyright: Copyrighted under Creative Commons License (Maybe. Am not sure how that works exactly): Please ask before you
use it. Even if you don’t ask, just attribute it. I know I won’t be able to do anything if you don’t attribute it. Except well maybe you
will step in a puddle while heading to an important meeting. Think about that.
2. I – Purpose of The Study
• To develop an insight in the community of movie actors/directors by analyzing their collaborations and
audiences’ reactions to these collaborations (by collaborations , I mean movies, but “collaborations” sounds
more scientific).
II – Methodology
• Since it would have been too heavy a computation to analyze the data for all the actors and directors; as a proof
of concept of this research, only the actors/directors who have featured in the IMDB Top 1000 Movies list
(sourced from http://www.icheckmovies.com/lists/imdb+top+1000/lampadatriste/) ,were kept as subjects
• The aforementioned page was saved as an HTML file and a python parsing script was used to parse through
the page and pick up IMDB links for top 1000 movie pages (code included no not included, because I ran out of
time trying to format/comment all the files, thus I could not submit the project, which was depressing. But let me
know if you need it. I will send it across. It’s a 14 line code including the empty lines)
• The links were then scraped using a python scraping script (using BeautifulSoup) to collect the following data
(code included)
• Director(s)
• 3 Main Stars (mentioned separately on IMDB movie page)
• Year of Release
• Movie Rating
• The data was stored in a csv file which was cleaned up manually (a bit) and then analyzed using R and written
into GML (source code included)
• The GML was then loaded up in Gephi to visualize and analyze the network (results included)
3. III – Defining Nodes, Edges, Edgeweights
Nodes:
• Each unique director, actor is a node (and shall be referred to as nodes/node henceforth)
• Node attributes:
• Id – Node Id
• Name – Name of Actor/Director
• Appcount – Number of movies in the list involving a particular node
• Skillscore – This is dependent on rating of each movie in which a node appears
Edges:
• All the nodes who have worked together in a movie get an edge between them.
• Edge Attributes:
• Source
• Target
• Count – Number of times each edge appears
• Edge weight – There are two ways to calculate edge weights, depending on what one wants to
understand
• Weight depending on Skillscores of involved nodes
• Weight depending on the quality of the partnership between two nodes (comes from
the movie rating)
4. III – Skillscore and Edge weights
R Code to help with a Rubric:
#INCLUDE THE REQUIRED LIBRARY
library(lattice)
#READ THE DATAFILE AND ASSIGN THE RATINGS
TO A NUMERIC VECTOR
data <- read.csv(<CSV FILE PATH>,
colClasses='character')
ratings <- as.numeric(data[,]$RATINGS)
plot(histogram(ratings))
#CALCULATE INTERVALS
span = seq(min(ratings),max(ratings)+.1, by=.3)
span.cut = cut(ratings, span, right=TRUE)
span.freq = table(span.cut)
print(span.freq)
Movie Rating Range X=Contribution to
Skillscore
[9.1, 9.5] 6.0
How is Skillscore Calculated?
[8.9, 9.0] 5.3 Skillscore is the sum of the contributions corresponding
[8.7, 8.8] 4.5 to each of a node’s movie’s rating (aka X in the table to
the left). So if I appear in three movies with ratings 7.1,
[8.3, 8.6] 3.5 9.2 and 8.2, my skillscore would be 2.0 + 6.0 + 3.0 =
[7.9, 8.2] 3.0 11.0.
[7.5, 7.8] 2.5 Simple. More could be done with it but for now, this is it.
[7.0, 7.4] 2.0
5. III – Skillscore and Edge-weights (b)
Edge-weight:
Option 1: Focus on the quality of the partnership between the nodes. Here the edge weight is calculated on the
basis of rating(s) of the movie(s) in which the two nodes appear together. For example, if node A and B occur
together in two movies of ratings 8.3 and 9.0, the edge-weight will be calculated using the same Skillscore rubric.
Formula is
Edge Weight = 1 + P*log(Q*skillscore)
Skillscore = sum of all X’s (using the rubric from the last slide) corresponding to each
movie’s rating (in which the two nodes appear together)
P, Q are constants which can be varied to optimize the width of edges in Gephi
Therefore, for our example
Edge Weight = 1 + P*log(Q*(3.5+5.3))
This is a straightforward way to figure out what sort of partnerships (between which nodes) work in the movie
business.
Option 2: Focus on the skills of the nodes. Here the edge weight is calculated on the basis of skillscores of each
individual node connected by an edge. For any nodes A and B, formula is:
Edge Weight = 1 + P*log(Q*skillscore(A)*skillscore(B))
Skillscore(i) = sum of all X’s (using the rubric from the last slide) corresponding to each
movie’s rating (in which the node i appears)
P, Q are constants which can be varied to optimize the width of edges in Gephi
This makes sense when we are calculating weighted degrees and want to figure out if a node is connected to
many averagely skilled nodes or few highly skilled nodes.
6. IV – What does the network look like?
Quantitative Details:
• Number of Nodes: 2159
• Number of Edges: 5813 (pretty sparse graph, evidently)
• Number of Connected Components: 80
• Size of Giant Component: 1782 nodes (makes up for Hollywood)
• Number of communities detected (Modularity Resolution = 2.0) ~ 90
• Average Degree: 5.385
• Minimum Degree: 3 (because each movie has at least one director and 3
actors, thus each movie is a clique on the graph in which each node’s
degree = 3)
• Maximum Degree: 53 (who could this be? Any guesses?)
• Clustering Coefficient: 0.775 (4297 triangles. Understandable as each
movie clique contains at least 4 nodes, so at least 4 triangles in each
clique and 1000 cliques, total 4000 triangles at least)
• Network Diameter = 11
• Avg. Path Length = 4.782
We’ll talk more Mathematics when we talk about edge weights later.
What is this image on the left?
This is the network. Visualized using Radial Axis Algorithm on Gephi. Here
the nodes have been arranged in a circle with each spar of the circle
belonging to a modularity class. The long spars are communities with most
members and the shadow beneath each spar is a dense network of edges.
We’ll talk more about what modularity classes signify in the coming slides.
As of now, we can see that the network is a collection of 10-14 big
communities and many small communities..
7. V – Observations and Inferences
Nodes (type director) of
one of the Communities
identified (one of the 90
modularity classes)
Running the Modularity calculations in
Gephi, these directors were grouped
(by the algorithm, not me) together as
members of the same community
(there were other actor members too,
but for the sake of understanding, we
are taking up only directors). This
community makes a lot of sense as
these directors do, in fact, belong to
the community of outstanding
directors who were very active in the
first half, and the early second half of
the 20th century. This is all my
substantial but limited knowledge of
movies has helped me understand. I
am sure a movie expert would be able
to see more similarities.
8. V – Observations and Inferences
Nodes (type director) of
one of the Communities
identified (one of the 90
modularity classes)
This is also an interesting community
detected because most of the
directors in this community are
patrons of world cinema and in fact
belong to the European continent.
Juan Antonio Bayona (a fairly young
director) has even cited much
acclaimed Guillermo Del Toro as his
inspiration and these two have been
grouped together without having
worked together. That was quite
amazing. Zach Braff is a bit of a
difficult nut to crack here, but he is
also a bit quirky when it comes to
directing. Don’t know for sure. Expert
movie opinion needed.
Note: Brown color means the
personality is both an actor and a
director (not necessarily in the same
movie)
9. V – Observations and Inferences
Nodes (type director) of
one of the Communities
identified (one of the 90
modularity classes)
Surprisingly, this identified community
of directors is similar in the sense that
most of the names in the inner circle
are cited and known for having their
own signature style of directing
movies. The styles may be different
but the attribute of having a signature
style is common. These have been
grouped together probably because
most of these directors tend to have a
fix set of actors with whom they make
movies and thus the degree attributes
etc might be the same (on the basis of
which they are grouped together).
But Wes Anderson’s inspiration is
Stanley Kubrick. And both Wes
Anderson and Guy Ritchie make
movies with emphasis on script and
dialogs (as did Stanley Kubrick).
10. V – Observations and Inferences
Nodes (type director) of
one of the Communities
identified (one of the 90
modularity classes)
The Method Directors or the Serious
Directors or The Cult Directors. Well
mostly Method Directors. These are
the directors who are passionate
about the movies they make and it
shows in their movies. They pay high
attention to details and are
comfortable with quite a few genres of
movies.
11. V – Observations and Inferences
Nodes (type director) of
one of the Communities
identified (one of the 90
modularity classes)
When edge weights depended on the
partnerships, the graph threw forward
this community class where Stanley
Kubrick is paired with Alfred
Hitchcock’s community (which wasn’t
the case earlier where edge weights
depended on the nodes and not the
partnerships). This is more accurate
on an objective basis because Stanley
Kubrick was most active in the same
era as Alfred Hitchcock but it can’t be
said for sure if they had the same
styles.
Probably, we can then say that the
partnership based edge weights are
more accurate for detecting
communities based on objective
criteria as active period and origin etc
(since the nodes will have same types
of neighbours).
12. V – Qualitative Analysis of The Network
Nodes (type director) of
one of the Communities
identified (one of the 90
modularity classes)
Compared to the community we saw
earlier with Juan Antonio Bayona as a
member, this community is more
focused on directors based out of
mainland Europe (or who primarily
work in different vernaculars). We find
Danny Boyle missing from this group
and other European directors joining
in. Another instance where
partnership based edge weights are
more accurate in identifying
communities on an objective basis.
13. V – Observations and Inferences
Tag Cloud Based on Weighted Degrees (Weight by Partnerships)
Understandably Robert De Niro (with maximum degree, 53 and 21 appearances), Steven Spielberg (18
appearances) and Clint Eastwood are the highlights. Please take note of Tom Cruise and Hayao Miyazaki (a
Japanese director who is well acclaimed for amazing animated movies) as this will be helpful in the next slide.
14. V – Observations and Inferences
Tag Cloud Based on Betweenness Centrality (Weight by Partnerships)
Tom Cruise now is in the middle and prominent because he has, over the period of time worked with many
directors/actors and is thus more important. Same goes for Matt Damon. Hayao Miyazaki is bigger again
because he is more central as a bridge between Hollywood and Japanese movie industry (while he didn’t have a
high weighted degree rank).
15. V – Observations and Inferences
Statistically the Most Important Partnerships in the World Movie Industry
No node is at the smooth part of the edge. This is important to specify because Orson Welles then appears to be
a bridge between Akira Kurosawa and Toshiro Mifune (which is not true). Each node is either at the ends of a
connection or at an inflection point in the edge curve. Also, this is statistical because from an artistic perspective,
Daniel Radcliffe, Emma Watson and Rupert Grint is not exactly the most important partnership. Purely from
artistic perspective.
16. V – Shortcomings of the Data and the Analysis
• The data collected is purely indicative. A better study would be to include all the prominent actors of a movie
and not the main 3 actors. The Main actors might not even be the main actors as they are listed based on
popular votes rather than their roles in the story.
• A better formula to calculate skillscore can be thought of (definitely) as the skillscore rubric used here might
be too naïve and simple.
• Data is only indicative of the quality of actors. More data about the movies (like how many votes, weighted
votes (IMDB uses weighted votes) etc) can make the network more granular.
• Constraints of Subjective Analysis: While working with any form of art/literature, it is highly uninteresting to
work only with the Mathematical Data without delving into the qualitative nature of the analysis. But on the
other hand, it is as difficult to have a solid understanding of the qualitative nature of the data because it is
highly dependent on the perceiver. Thus, most of the subjective analysis done in this project might or might
not be true depending on how informed an analyst is when it comes to movies.
• I found myself susceptible to hindsight bias and confirmation bias while analyzing this network. At one point, I
even thought I had apophenia (seeing meaningful patterns when none exist). Thus am not too sure of the
conclusions drawn (also because I don’t have as much pedantic knowledge of movies and directors and
actors) and they should be taken with a pinch of salt. They might be wrong, but as of now, since no one has
burst the bubble, I will take them to be true by intuition.