SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Social network analysis with Hadoop

                                        Jake Hofman

                                        Yahoo! Research


                                      October 2, 2009




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Social networks

  • Rapid increase in amount and variety of social network data




  • Valuable information for products (recommendations, advertising,
     etc.) and research (structure/dynamics, diffusion, etc.)


Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Social networks




  Goal: to enable analysis of large-scale social network data with readily
                       available software/hardware

Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
1970s ∼ 101 nodes                            456             JOURNAL OF ANTHROPOLOGICAL RESEARCH
                                                                               FIGURE 1
                                                         Social Network Model of Relationships   in the Karate Club

                                                                                 34      1
                                                                         33 3                     2




                                              27                                                                      8

                                             26 i                                                                     9

                                              25                                                                      10



                                                            CONFLICT AND FISSION IN SMALL GROUPS                           453

                                        to bounded social groups of all types in all settings. Also, the data
                                        required can be collected by a reliable method currently familiar to
                                        anthropologists, the use of nominal scales.
                                                                         19      18              16
                                                                           18   17
                                                                 THE ETHNOGRAPHIC RATIONALE
                                    The is the the clubrepresentationline ofis the socialbetween of three years, the indi-1970
                                    This
                                           karate karate was observed for a period two amongwhen 34 two
                                    viduals in
                                                 graphic
                                                             club. A            drawn
                                                                                            relationships         the
                                                                                                                       from
                                                                                                          points
                                to 1972. In addition to direct observation, the history of outside those of to
                                    individuals being represented consistently      interacted in contexts the club prior
                                the period of the study and club meetings. Each through drawn is referredandasclub
                                    karate classes, workouts, was reconstructed such line informants to
                                    an edge.
                                records in the university archives. During the period of observation, the
                                club maintained between 50 and 100 members, and its activities
                                    two individuals consistently were observed to interact outside the
                                included social affairs (parties, dances, and club
                                    normal activities of the club (karate classes banquets, etc.) Thatwell as
                                                                                                                      as
  • Few direct observations; highly detailed info on nodes and edges                                      meetings).
                                regularly scheduled ifkarate lessons. could be said to be friends outside the
                                    an edge is drawn          the individuals The political organization of
                                                                                                                          is,

                                clubthe club activities.This while there was a constitutionin Figure 2. officers,
                                      was informal, and graph is represented as a matrix and four All
                                most decisions were made nondirectional at represent interaction in both
                                    the edges in Figure 1 are by concensus          (they club meetings. For its classes,
  • E.g. karate club (Zachary, 1977)
                                the club employed thepart-time karate instructor, who will possible to to
                                    directions), and a graph is said to be symmetrical.It is also be referred
                                    draw edges that are directed (representing one-way relationships); such
                                as Mr. Hi.2
                                    At the beginning of the study there was an incipient conflict
                                between the club president, John A., and Mr. Hi over the price of
Jake Hofman   (Yahoo! Research) karate lessons. Mr. Hi, who analysis with prices, claimed the authority
                                             Social network wished to raise Hadoop                                               October 2, 2009
1990s ∼ 104 nodes




  • Larger, indirect samples; relatively few details on nodes and edges
  • E.g. APS co-authorship network (http://bit.ly/aps08jmh)

Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Present ∼ 108 nodes +




  • Very large, dynamic samples; many details in node and edge metadata
  • E.g. Mail, Messenger, Facebook, Twitter, etc.



Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Scale



                                                                             ...
 • Example numbers:
     • ∼ 107 nodes
     • ∼ 102 edges/node (degree)
                                                                    User 1         User 2
     • no node/edge data
     • static
     • ∼8GB
                                                                             ...




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop              October 2, 2009
Scale



                                                                             ...
 • Example numbers:
     • ∼ 107 nodes
     • ∼ 102 edges/node (degree)
                                                                    User 1         User 2
     • no node/edge data
     • static
     • ∼8GB
                                                                             ...

    Simple, static networks push memory limit for commodity machines




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop              October 2, 2009
Scale



                                                                                    ...
 • Example numbers:
     • ∼ 107 nodes
     • ∼ 102 edges/node (degree)                                                 Message
                                                                                Header
     • node/edge metadata                                              User 1   Content
                                                                                ...
                                                                                           User 2
                                                               User                                   User
     • dynamic                                               Profile
                                                             History
                                                                                                    Profile
                                                                                                    History
     • ∼100GB/day                                            ...                                    ...

                                                                                    ...




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop                        October 2, 2009
Scale



                                                                                    ...
 • Example numbers:
     • ∼ 107 nodes
     • ∼ 102 edges/node (degree)                                                 Message
                                                                                Header
     • node/edge metadata                                              User 1   Content
                                                                                ...
                                                                                           User 2
                                                               User                                   User
     • dynamic                                               Profile
                                                             History
                                                                                                    Profile
                                                                                                    History
     • ∼100GB/day                                            ...                                    ...

                                                                                    ...

     Dynamic, data-rich social networks exceed memory limits; require
                           considerable storage




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop                        October 2, 2009
Distributed network analysis




 MapReduce convenient for
 parallelizing individual
 node/edge-level calculations




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Distributed network analysis




 Higher-order calculations more
 difficult when network exceeds
 memory constraints, but can be
 adapted to MapReduce
 framework




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Package details

                                                              • Higher-order node-level
     • Network
                                                                  descriptive statistics
        creation/manipulation
                                                                      • Clustering coefficient
              •    Logs → edges
                                                                      • Implicit degree
              •    Edge list ↔ adjacency list
                                                                      • ...
              •    Directed ↔ undirected
              •    Edge thresholds                            • Global calculations
     • First-order descriptive                                    • Pairwise connectivity
                                                                  • Connected components
        statistics
                                                                  • Minimum spanning tree
              • Number of nodes
                                                                  • Breadth-first search
              • Number of edges
                                                                  • Pagerank
              • Node degrees
                                                                  • Community detection




Jake Hofman       (Yahoo! Research)   Social network analysis with Hadoop                  October 2, 2009
Package details

                                                              • Higher-order node-level
     • Network
                                                                  descriptive statistics
        creation/manipulation
                                                                      • Clustering coefficient
              •    Logs → edges
                                                                      • Implicit degree
              •    Edge list ↔ adjacency list
                                                                      • ...
              •    Directed ↔ undirected
              •    Edge thresholds                            • Global calculations
     • First-order descriptive                                    • Pairwise connectivity
                                                                  • Connected components
        statistics
                                                                  • Minimum spanning tree
              • Number of nodes
                                                                  • Breadth-first search
              • Number of edges
                                                                  • Pagerank
              • Node degrees
                                                                  • Community detection

                     Currently implemented in Streaming with Python
                     Algorithms exist/developed for additional features


Jake Hofman       (Yahoo! Research)   Social network analysis with Hadoop                  October 2, 2009
Application: Twitter




  • Distributed crawl of Twitter social network + public messages
     (crawler by Eytan Bakshy, http://bit.ly/eytanb)


Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Application: Twitter




  • Distributed crawl of Twitter social network + public messages
     (crawler by Eytan Bakshy, http://bit.ly/eytanb)
  • ∼ 25 million nodes, ∼ 800 million edges
Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Twitter: Degree Distribution
                             8
                           10
                                                                                   out−degree (friends)
                                                                                   in−degree (followers)
                             7
                           10


                             6
                           10


                             5
                           10
                   count




                             4
                           10


                             3
                           10


                             2
                           10


                             1
                           10


                             0
                           10
                               0    1              2            3              4         5                  6
                             10    10            10          10               10       10                  10
                                                            degree


  • Aggregates users by number of friends/followers seen in crawl

Jake Hofman   (Yahoo! Research)         Social network analysis with Hadoop                                     October 2, 2009
Twitter: Degree Distribution
                             8
                           10
                                                                                   out−degree (friends)
                                                                                   in−degree (followers)
                             7
                           10


                             6
                           10


                             5
                           10
                   count




                             4
                           10


                             3
                           10


                             2
                           10


                             1
                           10


                             0
                           10
                               0    1              2            3              4         5                  6
                             10    10            10          10               10       10                  10
                                                            degree


              Many people not followed by anyone; few followed by many
                     Most people follow at least a few others
Jake Hofman   (Yahoo! Research)         Social network analysis with Hadoop                                     October 2, 2009
Twitter: Node-level clustering coefficient

                                                    ?




                                                    ?
  • Fraction of edges amongst a node’s friends/followers (Watts &
     Strogatz, 1998)

Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Twitter: Node-level clustering coefficient
                                    8
                                   10
                                                                                                               followers
                                                                                                               friends
                                    7
                                   10


                                    6
                                   10


                 ?                  5
                                   10
                           count



                                    4
                                   10


                                    3
                                   10
                 ?
                                    2
                                   10


                                    1
                                   10


                                    0
                                   10
                                        0   0.1       0.2       0.3            0.4           0.5   0.6   0.7               0.8
                                                                      clustering coefficient


  • Fraction of edges amongst a node’s friends/followers (Watts &
     Strogatz, 1998)
Jake Hofman   (Yahoo! Research)             Social network analysis with Hadoop                                  October 2, 2009
Twitter: Node-level clustering coefficient
                                    8
                                   10
                                                                                                               followers
                                                                                                               friends
                                    7
                                   10


                                    6
                                   10


                 ?                  5
                                   10
                           count




                                    4
                                   10


                                    3
                                   10
                 ?
                                    2
                                   10


                                    1
                                   10


                                    0
                                   10
                                        0   0.1       0.2       0.3            0.4           0.5   0.6   0.7               0.8
                                                                      clustering coefficient



                Suprisingly high density at 0.5 (many isolated triangles)

Jake Hofman   (Yahoo! Research)             Social network analysis with Hadoop                                  October 2, 2009
Future plans



  • Open-source release
  • “A Model of Computation for MapReduce”, Karloff, Suri, &
     Vassilvitskii, Symposium on Discrete Algorithms, 2010 (Accepted)
  • Twitter analysis publication (In progress)



  Goal: to enable analysis of large-scale social network data with readily
                       available software/hardware




Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Collaborators


    • Eytan Bakshym,y
    • Sharad Goely
    • Winter Masony
    • Sid Suriy
    • Sergei Vassilvitskiiy
    • Duncan Wattsy
    • (You?)


y   Yahoo! Research (http://research.yahoo.com)
m   University of Michigan



Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009
Thanks.




                                         Questions?1
   1
       hofman@yahoo-inc.com, jakehofman.com
Jake Hofman   (Yahoo! Research)   Social network analysis with Hadoop   October 2, 2009

Contenu connexe

En vedette

2013 NodeXL Social Media Network Analysis
2013 NodeXL Social Media Network Analysis2013 NodeXL Social Media Network Analysis
2013 NodeXL Social Media Network AnalysisMarc Smith
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Pythonrik0
 
Chapter 8 Diffusion Networks
Chapter 8   Diffusion NetworksChapter 8   Diffusion Networks
Chapter 8 Diffusion NetworksMardy McGaw
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1Vimal Suthar
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOPKirthan S Holla
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock AnalysisVaibhav Jain
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPKirthan S Holla
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
 
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Hortonworks
 
An Introduction to Graph Databases
An Introduction to Graph DatabasesAn Introduction to Graph Databases
An Introduction to Graph DatabasesInfiniteGraph
 

En vedette (17)

2013 NodeXL Social Media Network Analysis
2013 NodeXL Social Media Network Analysis2013 NodeXL Social Media Network Analysis
2013 NodeXL Social Media Network Analysis
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Diffusion of Innovations Overview
Diffusion of Innovations OverviewDiffusion of Innovations Overview
Diffusion of Innovations Overview
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Python
 
Chapter 8 Diffusion Networks
Chapter 8   Diffusion NetworksChapter 8   Diffusion Networks
Chapter 8 Diffusion Networks
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOP
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOP
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
 
An Introduction to Graph Databases
An Introduction to Graph DatabasesAn Introduction to Graph Databases
An Introduction to Graph Databases
 

Similaire à HW09 Social network analysis with Hadoop

On relational sociology
On relational sociologyOn relational sociology
On relational sociologyNaoki Maejima
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)theijes
 
Social Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search RefinementSocial Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search RefinementYi Zeng
 
Sylva workshop.gt that camp.2012
Sylva workshop.gt that camp.2012Sylva workshop.gt that camp.2012
Sylva workshop.gt that camp.2012CameliaN
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social NetworksKent State University
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCGilad Lotan
 
Vinci2011会议演讲PPT
Vinci2011会议演讲PPTVinci2011会议演讲PPT
Vinci2011会议演讲PPTdasiyjun
 
20111123 mwa2011-marc smith
20111123 mwa2011-marc smith20111123 mwa2011-marc smith
20111123 mwa2011-marc smithMarc Smith
 
Digital Research and Big Data: Is the Tail Wagging the Dog?
Digital Research and Big Data: Is the Tail Wagging the Dog?Digital Research and Big Data: Is the Tail Wagging the Dog?
Digital Research and Big Data: Is the Tail Wagging the Dog?Eric Meyer
 
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...Marc Smith
 
Social Networks and Computer Science
Social Networks and Computer ScienceSocial Networks and Computer Science
Social Networks and Computer Sciencedragonmeteor
 
Networks in their surrounding contexts
Networks in their surrounding contextsNetworks in their surrounding contexts
Networks in their surrounding contextsVamshi Vangapally
 

Similaire à HW09 Social network analysis with Hadoop (14)

On relational sociology
On relational sociologyOn relational sociology
On relational sociology
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
Social Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search RefinementSocial Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search Refinement
 
Sylva workshop.gt that camp.2012
Sylva workshop.gt that camp.2012Sylva workshop.gt that camp.2012
Sylva workshop.gt that camp.2012
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social Networks
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYC
 
Vinci2011会议演讲PPT
Vinci2011会议演讲PPTVinci2011会议演讲PPT
Vinci2011会议演讲PPT
 
20111123 mwa2011-marc smith
20111123 mwa2011-marc smith20111123 mwa2011-marc smith
20111123 mwa2011-marc smith
 
Network Theory
Network TheoryNetwork Theory
Network Theory
 
Digital Research and Big Data: Is the Tail Wagging the Dog?
Digital Research and Big Data: Is the Tail Wagging the Dog?Digital Research and Big Data: Is the Tail Wagging the Dog?
Digital Research and Big Data: Is the Tail Wagging the Dog?
 
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
20121001 pawcon 2012-marc smith - mapping collections of connections in socia...
 
Social Networks and Computer Science
Social Networks and Computer ScienceSocial Networks and Computer Science
Social Networks and Computer Science
 
Networks in their surrounding contexts
Networks in their surrounding contextsNetworks in their surrounding contexts
Networks in their surrounding contexts
 
Simple SNA.pdf
Simple SNA.pdfSimple SNA.pdf
Simple SNA.pdf
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

HW09 Social network analysis with Hadoop

  • 1. Social network analysis with Hadoop Jake Hofman Yahoo! Research October 2, 2009 Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 2. Social networks • Rapid increase in amount and variety of social network data • Valuable information for products (recommendations, advertising, etc.) and research (structure/dynamics, diffusion, etc.) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 3. Social networks Goal: to enable analysis of large-scale social network data with readily available software/hardware Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 4. 1970s ∼ 101 nodes 456 JOURNAL OF ANTHROPOLOGICAL RESEARCH FIGURE 1 Social Network Model of Relationships in the Karate Club 34 1 33 3 2 27 8 26 i 9 25 10 CONFLICT AND FISSION IN SMALL GROUPS 453 to bounded social groups of all types in all settings. Also, the data required can be collected by a reliable method currently familiar to anthropologists, the use of nominal scales. 19 18 16 18 17 THE ETHNOGRAPHIC RATIONALE The is the the clubrepresentationline ofis the socialbetween of three years, the indi-1970 This karate karate was observed for a period two amongwhen 34 two viduals in graphic club. A drawn relationships the from points to 1972. In addition to direct observation, the history of outside those of to individuals being represented consistently interacted in contexts the club prior the period of the study and club meetings. Each through drawn is referredandasclub karate classes, workouts, was reconstructed such line informants to an edge. records in the university archives. During the period of observation, the club maintained between 50 and 100 members, and its activities two individuals consistently were observed to interact outside the included social affairs (parties, dances, and club normal activities of the club (karate classes banquets, etc.) Thatwell as as • Few direct observations; highly detailed info on nodes and edges meetings). regularly scheduled ifkarate lessons. could be said to be friends outside the an edge is drawn the individuals The political organization of is, clubthe club activities.This while there was a constitutionin Figure 2. officers, was informal, and graph is represented as a matrix and four All most decisions were made nondirectional at represent interaction in both the edges in Figure 1 are by concensus (they club meetings. For its classes, • E.g. karate club (Zachary, 1977) the club employed thepart-time karate instructor, who will possible to to directions), and a graph is said to be symmetrical.It is also be referred draw edges that are directed (representing one-way relationships); such as Mr. Hi.2 At the beginning of the study there was an incipient conflict between the club president, John A., and Mr. Hi over the price of Jake Hofman (Yahoo! Research) karate lessons. Mr. Hi, who analysis with prices, claimed the authority Social network wished to raise Hadoop October 2, 2009
  • 5. 1990s ∼ 104 nodes • Larger, indirect samples; relatively few details on nodes and edges • E.g. APS co-authorship network (http://bit.ly/aps08jmh) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 6. Present ∼ 108 nodes + • Very large, dynamic samples; many details in node and edge metadata • E.g. Mail, Messenger, Facebook, Twitter, etc. Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 7. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node (degree) User 1 User 2 • no node/edge data • static • ∼8GB ... Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 8. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node (degree) User 1 User 2 • no node/edge data • static • ∼8GB ... Simple, static networks push memory limit for commodity machines Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 9. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node (degree) Message Header • node/edge metadata User 1 Content ... User 2 User User • dynamic Profile History Profile History • ∼100GB/day ... ... ... Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 10. Scale ... • Example numbers: • ∼ 107 nodes • ∼ 102 edges/node (degree) Message Header • node/edge metadata User 1 Content ... User 2 User User • dynamic Profile History Profile History • ∼100GB/day ... ... ... Dynamic, data-rich social networks exceed memory limits; require considerable storage Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 11. Distributed network analysis MapReduce convenient for parallelizing individual node/edge-level calculations Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 12. Distributed network analysis Higher-order calculations more difficult when network exceeds memory constraints, but can be adapted to MapReduce framework Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 13. Package details • Higher-order node-level • Network descriptive statistics creation/manipulation • Clustering coefficient • Logs → edges • Implicit degree • Edge list ↔ adjacency list • ... • Directed ↔ undirected • Edge thresholds • Global calculations • First-order descriptive • Pairwise connectivity • Connected components statistics • Minimum spanning tree • Number of nodes • Breadth-first search • Number of edges • Pagerank • Node degrees • Community detection Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 14. Package details • Higher-order node-level • Network descriptive statistics creation/manipulation • Clustering coefficient • Logs → edges • Implicit degree • Edge list ↔ adjacency list • ... • Directed ↔ undirected • Edge thresholds • Global calculations • First-order descriptive • Pairwise connectivity • Connected components statistics • Minimum spanning tree • Number of nodes • Breadth-first search • Number of edges • Pagerank • Node degrees • Community detection Currently implemented in Streaming with Python Algorithms exist/developed for additional features Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 15. Application: Twitter • Distributed crawl of Twitter social network + public messages (crawler by Eytan Bakshy, http://bit.ly/eytanb) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 16. Application: Twitter • Distributed crawl of Twitter social network + public messages (crawler by Eytan Bakshy, http://bit.ly/eytanb) • ∼ 25 million nodes, ∼ 800 million edges Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 17. Twitter: Degree Distribution 8 10 out−degree (friends) in−degree (followers) 7 10 6 10 5 10 count 4 10 3 10 2 10 1 10 0 10 0 1 2 3 4 5 6 10 10 10 10 10 10 10 degree • Aggregates users by number of friends/followers seen in crawl Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 18. Twitter: Degree Distribution 8 10 out−degree (friends) in−degree (followers) 7 10 6 10 5 10 count 4 10 3 10 2 10 1 10 0 10 0 1 2 3 4 5 6 10 10 10 10 10 10 10 degree Many people not followed by anyone; few followed by many Most people follow at least a few others Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 19. Twitter: Node-level clustering coefficient ? ? • Fraction of edges amongst a node’s friends/followers (Watts & Strogatz, 1998) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 20. Twitter: Node-level clustering coefficient 8 10 followers friends 7 10 6 10 ? 5 10 count 4 10 3 10 ? 2 10 1 10 0 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 clustering coefficient • Fraction of edges amongst a node’s friends/followers (Watts & Strogatz, 1998) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 21. Twitter: Node-level clustering coefficient 8 10 followers friends 7 10 6 10 ? 5 10 count 4 10 3 10 ? 2 10 1 10 0 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 clustering coefficient Suprisingly high density at 0.5 (many isolated triangles) Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 22. Future plans • Open-source release • “A Model of Computation for MapReduce”, Karloff, Suri, & Vassilvitskii, Symposium on Discrete Algorithms, 2010 (Accepted) • Twitter analysis publication (In progress) Goal: to enable analysis of large-scale social network data with readily available software/hardware Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 23. Collaborators • Eytan Bakshym,y • Sharad Goely • Winter Masony • Sid Suriy • Sergei Vassilvitskiiy • Duncan Wattsy • (You?) y Yahoo! Research (http://research.yahoo.com) m University of Michigan Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
  • 24. Thanks. Questions?1 1 hofman@yahoo-inc.com, jakehofman.com Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009