SlideShare une entreprise Scribd logo
1  sur  24
Scalable Topic-Specific Influence Analysis on 
© 2013 IBM Corporation 
Microblogs 
Yuanyuan Tian 
IBM Almaden Research Center
Motivation 
 Huge amount of textual and social information produced by 
popular microblogging sites. 
• Twitter had over 500 million users creating over 340 million tweets 
daily, reported in March 2012. 
 A popular resource for marketing campaigns 
• Monitor opinions of consumers 
• Launch viral advertising 
 Identifying key influencers is crucial for market analysis 
© 2 2013 IBM Corporation
Existing Social Influence Analysis 
 Majority existing influence analyses are only based on 
network structures 
• E.g. influence maximization work by Kleinberg et al 
• Valuable textual content is ignored 
• Only global influencers are identified 
 Topic-specific influence analysis 
• Network + Content 
• Differentiate influence in different aspects of life (topics) 
© 3 2013 IBM Corporation
Existing Topic-Specific Influence Analysis 
 Separate analysis of content and analysis of networks 
• E.g. Topic-sensitive PageRank (TSPR), TwitterRank 
• Text analysis  topics 
• For each topic, apply influence analysis on the network structure 
• But in microblog networks, content and links are often related 
 A user tends to follow another who tweets in similar topics 
 Analysis of content and network together 
• E.g. Link-LDA 
• But designed for citation and hyperlink networks 
• Assumption: content is the only cause of links 
• But in microblog networks, a user sometimes follows another for a 
content-independent reason 
 A user follows celebrities simply because of their fame and stardom 
© 4 2013 IBM Corporation
© 2013 IBM Corporation 
Overview 
 Goal: Search for topic-specific key 
influencers on microblogs 
 Model topic-specific influence in 
microblog networks 
 Learn topic-specific influence efficiently 
 Put it all together in a search framework 
healthy food 
Michelle Obama 
Jimmy Oliver 
… 
SSKKIITT
Road Map 
 Followship-LDA (FLDA) Model 
 Scalable Gibbs Sampling Algorithm for FLDA 
 A General Search Framework for Topic-Specific Influencers 
 Experimental Results 
© 6 2013 IBM Corporation
© 2013 IBM Corporation 
Intuition 
 Each microblog user has a bag of words (from tweets) and a set of 
followees 
Content: web, organic, veggie, cookie, cloud, … 
Followees: Michelle Obama, Mark Zuckerberg, Barack Obama 
Alice 
 A user tweets in multiple topics 
• Alice tweets about technology and food 
 A topic is a mixture of words 
• “web” and “cloud” are more likely to appear in the technology topic 
 A user follows another for different reasons 
• Content-based: follow for topics 
• Content-independent: follow for popularity 
 A topic has a mixture of followees 
• Mark Zuckerberg is more likely to be followed for the technology topic 
• Topic-specific influence: the probability of a user u being followed for a topic t
© 2013 IBM Corporation 
Followship-LDA (FLDA) 
 FLDA: A Bayesian generative model that extends Latent 
Dirichlet Allocation (LDA) 
• Specify a probabilistic procedure by which the content and links of each 
user are generated in a microblog network 
• Latent variables (hidden structure) are introduced to explain the 
observed data 
 Topics, reasons of a user following another, the topic-specific influence, etc 
• Inference: “reverse” the generative process 
 What hidden structure is most likely to have generated the observed data?
© 2013 IBM Corporation 
Hidden Structure in FLDA 
Tech Food Poli 
Alice 0.8 0.19 0.01 
Bob 0.6 0.3 0.1 
Mark … … … 
Michelle … … … 
Barack … … … 
 Per-user topic distribution 
 Per-topic word distribution 
 Per-user followship preference 
 Per-topic followee distribution 
 Global followee distribution 
web cookie veggie organic congress … 
Tech 0.3 0.1 0.001 0.001 0.001 … 
Food 0.001 0.15 0.3 0.1 0.001 … 
Poli 0.005 0.001 0.001 0.002 0.25 … 
Alice Bob Mark Michelle Barack 
Tech 0.1 0.05 0.7 0.05 0.1 
Food 0.04 0.1 0.06 0.75 0.05 
Poli 0.03 0.02 0.05 0.1 0.8 
Y N 
Alice 0.75 0.25 
Bob 0.5 0.5 
Mark … … 
Michelle … … 
Barack … … 
Alice Bob Mark Michelle Barack 
Global 0.005 0.001 0.244 0.25 0.5 
User 
Topic 
Topic Topic 
Word 
User 
User 
Preference 
User 
Topic-specific influence Global popularity
Generative Process of FLDA 
For the mth user: 
Pick a per-user topic distribution θm 
Pick a per-user followship preference μm 
For the nth word 
• Pick a topic based on topic distribution θm 
• Pick a word based on per-topic word distribution φt 
For the lth followee 
• Choose the cause based on followship preference μm 
• If content-related 
 Pick a topic based on topic distribution θm 
 Pick a followee based on per-topic followee distribution σt 
• Otherwise 
 Pick a followee based on global followee distribution π 
Alice 
Plate Notation 
Content: 
Followees: 
user 
link word 
Tech 
Tech: 0.8, Food: 0.19, Poli: 0.01 
Follow for content: content 
0.75, not: 0.25 
0.2 
web 
, organic, … 
Michelle Obama 
, Barack Obama, … 
© 10 2013 IBM Corporation
Road Map 
 Followship-LDA (FLDA) Model 
 Scalable Gibbs Sampling Algorithm for FLDA 
 A General Search Framework for Topic-Specific Influencers 
 Experimental Results 
© 11 2013 IBM Corporation
Inference of FLDA Model 
 Gibbs Sampling: A Markov chain Monte Carlo algorithm to approximate 
the distribution of latent variables based on the observed data. 
 Gibbs Sampling Process 
• Begin with some initial value for each variable 
• Iteratively sample each variable conditioned on the current values of the other 
variables, then update the variable with its new value (100s of iterations) 
• The samples are used to approximate distributions of latent variables 
 Derived conditional distributions for FLDA Gibbs Sampling 
p(zm,n=t|-zm,n): prob. topic of nth word of mth user is t given the current values of all others 
p(ym,l=r|-ym,l): prob. preference of lth link of mth user is r given the current values of all others 
p(xm,l=t|-xm,l, ym,l=1): prob. topic of lth link of mth user is t given the current values of all others 
© 12 2013 IBM Corporation
© 2013 IBM Corporation 
Gibbs Sampling for FLDA 
 In each pass of data, for the mth user 
• For the nth word (observed value w), 
 Sample a new topic assignment t’, based on p(zm,n|-zm,n) 
• For lth followee (observed value e), 
 Sample a new preference r’, based on p(ym,l|-ym,l) 
 If r’=1 (content based), sample a new topic t’, based on p(xm,l|-xm,l, ym,l=1) 
• Keep counters while sampling 
 cm 
w,t: # times w is assigned to t for mth user 
 dm 
e,t : # times e is assigned to t for mth user 
 Estimate posterior distributions for latent variables 
per-user topic distribution 
per-topic followee distribution (influence)
Distributed Gibbs Sampling for FLDA 
 Challenge: Gibbs Sampling process is sequential 
• Each sample step relies on the most recent values of all other variables. 
• Sequential algorithm would run for 21 days, on a high end server (192 GB RAM, 
3.2GHz processor) for a Twitter data set with 1.8M users, 2.4B words and 183M 
links. 
 Observation: dependency between variable assignments is relatively 
weak, given the abundance of words and links 
 Solution: Distributed Gibbs Sampling for FLDA 
• Relax the sequential requirement of Gibbs Sampling 
• Implemented on Spark – a distributed processing framework for iterative 
machine learning workloads 
© 14 2013 IBM Corporation
Distributed Gibbs Sampling for FLDA 
global counters 
Local counters 
User 1 
User 2 
… 
Local counters 
global counters 
User 13 
User 14 
… 
Global Counter: 
ce,t, … 
* 
User 25 
User 26 
… 
Word W1 W2 W3 
Topic T5 T4 T5 
Link L1 L2 
Prefer Y N 
Topic T4 . 
Local Counter: 
cw,t, de,t, … 
m 
m 
w,t, d* 
master 
global counters 
Local counters 
worker worker worker 
© 15 2013 IBM Corporation
© 2013 IBM Corporation 
Issues with Distributed Gibbs Sampling 
 Check-pointing for fault tolerance 
• Spark used lineage for fault tolerance 
 Frequency of global synchronization 
• Even every 10 iterations won’t affect the quality of result 
 Random number generator in a distributed environment 
• Provably independent multiple streams of uniform numbers 
• A long-period jump ahead F2-Linear Random Number Generator 
 Efficient local computation 
• Aware of the local memory hierachy 
• Avoid random memory access by sampling in order
Road Map 
 Followship-LDA (FLDA) Model 
 Scalable Gibbs Sampling Algorithm for FLDA 
 A General Search Framework for Topic-Specific Influencers 
 Experimental Results 
© 17 2013 IBM Corporation
Search Framework For Topic-Specific Influencers 
 SKIT: search framework for topic-spefic key influencers 
• Input: key words 
• Output: ranked list of key influencers 
• Can plug in various key influencer methods 
 FLDA, Link-LDA, Topic-Sensitive PageRank (TSPR), TwitterRank 
 Step 1: Derivation of interested topics 
SSKKIITT 
from key words 
• Treat the key words as the content of a new user 
• “Fold in” to the existing model 
• Result: a topic distribution of the input <p1, p2, …, pt, …> 
 Step 2: Compute influence score for each user 
• Per-topic influence score for each user from FLDA: infu 
t 
• Combine the score infu=Ʃ pt x infu 
t 
healthy food 
Michelle Obama 
Jimmy Oliver 
… 
 Step 3: Sort users by influence scores and return the top influencers 
© 18 2013 IBM Corporation
Road Map 
 Followship-LDA (FLDA) Model 
 Scalable Gibbs Sampling Algorithm for FLDA 
 A General Search Framework for Topic-Specific Influencers 
 Experimental Results 
© 19 2013 IBM Corporation
Effectiveness on Twitter Dataset 
 Twitter Dataset crawled in 2010 
• 1.76M users, 2.36B words, 183M links, 159K distinct words 
Topic Top-10 key words Top-5 Influencers 
“Information 
Technology” 
data, web, cloud, software, open, 
windows, Microsoft, server, 
security, code 
“Cycling and 
Running” 
Bike, ride, race, training, running, 
miles, team, workout, marathon, 
fitness 
“Jobs” Business, job, jobs, management, 
manger, sales, services, company, 
service, small, hiring 
 Interesting findings 
Tim O’ Reilly, Gartner Inc, Scott Hanselman (software 
blogger), Jeff Atwood (software blogger, co-founder of 
stackoverflow.com), Elijah Manor (software blogger) 
Lance Armstrong, Levi Leipheimer, Geroge Hincapie (all 
3 US Postal pro cycling team members), Johan Bruyeel 
(US Postal team director), RSLT (a pro cycling team) 
Job-hunt.org, jobsguy.com, integritystaffing.com/blog 
(job search Ninja), jobConcierge.com, 
careerealism.com 
• Overall ~ 15% of followships are content-independent 
• Top-5 global popular users: Pete Wentz (singer), Ashton Kutcher (actor), Greg 
Grunberg (actor), Britney Spears (singer), Ellen Degeneres (comedian) 
© 20 2013 IBM Corporation
Effectiveness on Tencent Weibo Dataset 
 Tencent Weibo Dataset form KDD Cup 2012 
• 2.33M users, 492M words, 51M links, 714K distinct words 
• VIP users – used as “ground truth” 
 Organized in categories 
 Manually labeled “key influencers” in the corresponding categories 
• Quality evaluation 
 Search Input: content of the VIP user 
 Precision: how many of the top-k results are the other VIP users of the same category 
 Measure: Mean Average Precision (MAP) for all the VIP users across all categories 
 FLDA is significantly better than 
previous approaches 
• 2x better than TSPR and TwitterRank 
• 1.6x better than Link-LDA 
 Distributed FLDA achieves similar 
quality results as sequential FLDA 
© 21 2013 IBM Corporation
Efficiency of Distributed FLDA 
 Setup 
• Sequential: a high end server (192 GB RAM, 3.2GHz) 
• Distributed: 27 servers (32GB RAM, 2.5GHz, 8 cores) connected by 
1Gbit Ethernet, 1 master and 200 workers 
 Dataset 
• Tencent Weibo: 2.33M users, 492M words, 51M links 
• Twitter: 1.76M users, 2.36B words, 183M links 
 Sequential vs. Distributed 
• 100 topics, 500 iterations 
Sequential Distributed 
Weibo 4.6 days 8 hours 
Twitter 21 days 1.5 days 
© 22 2013 IBM Corporation
© 2013 IBM Corporation 
Scalability of Distributed FLDA 
 Scale up experiment on Twitter Dataset 
23 
corpus size=12.5% corpus size=25% 
corpus size=50% corpus size=100% 
3000 
2000 
1000 
0 
3000 
2000 
1000 
0 
25 50 100 200 25 50 100 200 
#concurrent workers 
Wall clock time per Iteration (sec) 
nTopics 25 50 100 200
Conclusion 
 FLDA model for topic-specific influence analysis 
• Content + Links 
 Distributed Gibbs Sampling algorithm for FLDA 
• Comparable results as sequential Gibbs Sampling algorithm 
• Scales nicely with # workers and data sizes 
 General search framework for topic-specific key influencers 
• Can plug in various key influencer methods 
 Significantly higher quality results than previous work 
• 2x better than PageRank based methods 
• 1.6x better than Link-LDA 
© 24 2013 IBM Corporation

Contenu connexe

Tendances

Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...Marco Brambilla
 
Choosing the right crowd. Expert finding in social networks. edbt 2013
Choosing the right crowd. Expert finding in social networks. edbt 2013Choosing the right crowd. Expert finding in social networks. edbt 2013
Choosing the right crowd. Expert finding in social networks. edbt 2013Marco Brambilla
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
 
Developing the korean_internet_network_miner_change
Developing the korean_internet_network_miner_changeDeveloping the korean_internet_network_miner_change
Developing the korean_internet_network_miner_changeHan Woo PARK
 
13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)
13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)
13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)dnac
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLPGVS Chaitanya
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit Ipkaviya
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studiesdnac
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collectiondnac
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignJonathan Stray
 
Recommendation systems
Recommendation systems  Recommendation systems
Recommendation systems Badr Hirchoua
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measuresdnac
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)es712
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
 
IEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia Communities
IEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia CommunitiesIEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia Communities
IEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia CommunitiesKalman Graffi
 
Developing a Secured Recommender System in Social Semantic Network
Developing a Secured Recommender System in Social Semantic NetworkDeveloping a Secured Recommender System in Social Semantic Network
Developing a Secured Recommender System in Social Semantic NetworkTamer Rezk
 

Tendances (20)

Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...
 
Choosing the right crowd. Expert finding in social networks. edbt 2013
Choosing the right crowd. Expert finding in social networks. edbt 2013Choosing the right crowd. Expert finding in social networks. edbt 2013
Choosing the right crowd. Expert finding in social networks. edbt 2013
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
Developing the korean_internet_network_miner_change
Developing the korean_internet_network_miner_changeDeveloping the korean_internet_network_miner_change
Developing the korean_internet_network_miner_change
 
13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)
13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)
13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit I
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter Design
 
Recommendation systems
Recommendation systems  Recommendation systems
Recommendation systems
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...
 
IEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia Communities
IEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia CommunitiesIEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia Communities
IEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia Communities
 
Developing a Secured Recommender System in Social Semantic Network
Developing a Secured Recommender System in Social Semantic NetworkDeveloping a Secured Recommender System in Social Semantic Network
Developing a Secured Recommender System in Social Semantic Network
 

En vedette

Three Critical Elements for Building Cognitive Business Operations
Three Critical Elements for Building Cognitive Business OperationsThree Critical Elements for Building Cognitive Business Operations
Three Critical Elements for Building Cognitive Business OperationsIBM BPM
 
Intro to XPages for Administrators (DanNotes, November 28, 2012)
Intro to XPages for Administrators (DanNotes, November 28, 2012)Intro to XPages for Administrators (DanNotes, November 28, 2012)
Intro to XPages for Administrators (DanNotes, November 28, 2012)Per Henrik Lausten
 
Impact of loyalty programs in retailing business in India for creating long t...
Impact of loyalty programs in retailing business in India for creating long t...Impact of loyalty programs in retailing business in India for creating long t...
Impact of loyalty programs in retailing business in India for creating long t...Love Suryavanshi
 
Notes from 2016 bay area deep learning school
Notes from 2016 bay area deep learning school Notes from 2016 bay area deep learning school
Notes from 2016 bay area deep learning school Niketan Pansare
 
Cognitive Work Assistants - Vision and Open Challenges
Cognitive Work Assistants - Vision and Open ChallengesCognitive Work Assistants - Vision and Open Challenges
Cognitive Work Assistants - Vision and Open ChallengesHamid Motahari
 
Towards Cognitive BPM as a Platform for Smart Process Support over Unstructur...
Towards Cognitive BPM as a Platform for Smart Process Support over Unstructur...Towards Cognitive BPM as a Platform for Smart Process Support over Unstructur...
Towards Cognitive BPM as a Platform for Smart Process Support over Unstructur...Hamid Motahari
 
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...Philippe Laborie
 
Accelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsAccelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsPhilippe Laborie
 
9/9/16 Top 5 Deep Learning
9/9/16 Top 5 Deep Learning9/9/16 Top 5 Deep Learning
9/9/16 Top 5 Deep LearningNVIDIA
 
An introduction to CP Optimizer
An introduction to CP OptimizerAn introduction to CP Optimizer
An introduction to CP OptimizerPhilippe Laborie
 
SSO - single sign on solution for banks and financial organizations
SSO - single sign on solution for banks and financial organizationsSSO - single sign on solution for banks and financial organizations
SSO - single sign on solution for banks and financial organizationsMohammad Shahnewaz
 
IBM Business Process Management
IBM Business Process ManagementIBM Business Process Management
IBM Business Process ManagementAsif Hussain
 
Best Practices for Enterprise Social Media Management by the Social Media Dre...
Best Practices for Enterprise Social Media Management by the Social Media Dre...Best Practices for Enterprise Social Media Management by the Social Media Dre...
Best Practices for Enterprise Social Media Management by the Social Media Dre...Sprinklr
 
Become a business process hero in four simple steps
Become a business process hero in four simple stepsBecome a business process hero in four simple steps
Become a business process hero in four simple stepsIBM BPM
 

En vedette (16)

Tour de cyclopath v10
Tour de cyclopath v10Tour de cyclopath v10
Tour de cyclopath v10
 
Three Critical Elements for Building Cognitive Business Operations
Three Critical Elements for Building Cognitive Business OperationsThree Critical Elements for Building Cognitive Business Operations
Three Critical Elements for Building Cognitive Business Operations
 
(ppt)
(ppt)(ppt)
(ppt)
 
Intro to XPages for Administrators (DanNotes, November 28, 2012)
Intro to XPages for Administrators (DanNotes, November 28, 2012)Intro to XPages for Administrators (DanNotes, November 28, 2012)
Intro to XPages for Administrators (DanNotes, November 28, 2012)
 
Impact of loyalty programs in retailing business in India for creating long t...
Impact of loyalty programs in retailing business in India for creating long t...Impact of loyalty programs in retailing business in India for creating long t...
Impact of loyalty programs in retailing business in India for creating long t...
 
Notes from 2016 bay area deep learning school
Notes from 2016 bay area deep learning school Notes from 2016 bay area deep learning school
Notes from 2016 bay area deep learning school
 
Cognitive Work Assistants - Vision and Open Challenges
Cognitive Work Assistants - Vision and Open ChallengesCognitive Work Assistants - Vision and Open Challenges
Cognitive Work Assistants - Vision and Open Challenges
 
Towards Cognitive BPM as a Platform for Smart Process Support over Unstructur...
Towards Cognitive BPM as a Platform for Smart Process Support over Unstructur...Towards Cognitive BPM as a Platform for Smart Process Support over Unstructur...
Towards Cognitive BPM as a Platform for Smart Process Support over Unstructur...
 
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...
 
Accelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsAccelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer Models
 
9/9/16 Top 5 Deep Learning
9/9/16 Top 5 Deep Learning9/9/16 Top 5 Deep Learning
9/9/16 Top 5 Deep Learning
 
An introduction to CP Optimizer
An introduction to CP OptimizerAn introduction to CP Optimizer
An introduction to CP Optimizer
 
SSO - single sign on solution for banks and financial organizations
SSO - single sign on solution for banks and financial organizationsSSO - single sign on solution for banks and financial organizations
SSO - single sign on solution for banks and financial organizations
 
IBM Business Process Management
IBM Business Process ManagementIBM Business Process Management
IBM Business Process Management
 
Best Practices for Enterprise Social Media Management by the Social Media Dre...
Best Practices for Enterprise Social Media Management by the Social Media Dre...Best Practices for Enterprise Social Media Management by the Social Media Dre...
Best Practices for Enterprise Social Media Management by the Social Media Dre...
 
Become a business process hero in four simple steps
Become a business process hero in four simple stepsBecome a business process hero in four simple steps
Become a business process hero in four simple steps
 

Similaire à Scalable Topic-Specific Influence Analysis on Microblogs

Klout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsKlout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsTyler Singletary
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
A flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVA flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVIntoTheMinds
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVFrancisco Couto
 
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...Sri Ambati
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningNeo4j
 
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleQcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleXavier Amatriain
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueXavier Amatriain
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation承剛 謝
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...Ed Chi
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Summary of a Recommender Systems Survey paper
Summary of a Recommender Systems Survey paperSummary of a Recommender Systems Survey paper
Summary of a Recommender Systems Survey paperChangsung Moon
 

Similaire à Scalable Topic-Specific Influence Analysis on Microblogs (20)

Slides ecir2016
Slides ecir2016Slides ecir2016
Slides ecir2016
 
Klout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsKlout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIs
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
A flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVA flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TV
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TV
 
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Recsys 2016
Recsys 2016Recsys 2016
Recsys 2016
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleQcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business Value
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
DIE 20130724
DIE 20130724DIE 20130724
DIE 20130724
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Summary of a Recommender Systems Survey paper
Summary of a Recommender Systems Survey paperSummary of a Recommender Systems Survey paper
Summary of a Recommender Systems Survey paper
 

Dernier

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 

Dernier (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 

Scalable Topic-Specific Influence Analysis on Microblogs

  • 1. Scalable Topic-Specific Influence Analysis on © 2013 IBM Corporation Microblogs Yuanyuan Tian IBM Almaden Research Center
  • 2. Motivation  Huge amount of textual and social information produced by popular microblogging sites. • Twitter had over 500 million users creating over 340 million tweets daily, reported in March 2012.  A popular resource for marketing campaigns • Monitor opinions of consumers • Launch viral advertising  Identifying key influencers is crucial for market analysis © 2 2013 IBM Corporation
  • 3. Existing Social Influence Analysis  Majority existing influence analyses are only based on network structures • E.g. influence maximization work by Kleinberg et al • Valuable textual content is ignored • Only global influencers are identified  Topic-specific influence analysis • Network + Content • Differentiate influence in different aspects of life (topics) © 3 2013 IBM Corporation
  • 4. Existing Topic-Specific Influence Analysis  Separate analysis of content and analysis of networks • E.g. Topic-sensitive PageRank (TSPR), TwitterRank • Text analysis  topics • For each topic, apply influence analysis on the network structure • But in microblog networks, content and links are often related  A user tends to follow another who tweets in similar topics  Analysis of content and network together • E.g. Link-LDA • But designed for citation and hyperlink networks • Assumption: content is the only cause of links • But in microblog networks, a user sometimes follows another for a content-independent reason  A user follows celebrities simply because of their fame and stardom © 4 2013 IBM Corporation
  • 5. © 2013 IBM Corporation Overview  Goal: Search for topic-specific key influencers on microblogs  Model topic-specific influence in microblog networks  Learn topic-specific influence efficiently  Put it all together in a search framework healthy food Michelle Obama Jimmy Oliver … SSKKIITT
  • 6. Road Map  Followship-LDA (FLDA) Model  Scalable Gibbs Sampling Algorithm for FLDA  A General Search Framework for Topic-Specific Influencers  Experimental Results © 6 2013 IBM Corporation
  • 7. © 2013 IBM Corporation Intuition  Each microblog user has a bag of words (from tweets) and a set of followees Content: web, organic, veggie, cookie, cloud, … Followees: Michelle Obama, Mark Zuckerberg, Barack Obama Alice  A user tweets in multiple topics • Alice tweets about technology and food  A topic is a mixture of words • “web” and “cloud” are more likely to appear in the technology topic  A user follows another for different reasons • Content-based: follow for topics • Content-independent: follow for popularity  A topic has a mixture of followees • Mark Zuckerberg is more likely to be followed for the technology topic • Topic-specific influence: the probability of a user u being followed for a topic t
  • 8. © 2013 IBM Corporation Followship-LDA (FLDA)  FLDA: A Bayesian generative model that extends Latent Dirichlet Allocation (LDA) • Specify a probabilistic procedure by which the content and links of each user are generated in a microblog network • Latent variables (hidden structure) are introduced to explain the observed data  Topics, reasons of a user following another, the topic-specific influence, etc • Inference: “reverse” the generative process  What hidden structure is most likely to have generated the observed data?
  • 9. © 2013 IBM Corporation Hidden Structure in FLDA Tech Food Poli Alice 0.8 0.19 0.01 Bob 0.6 0.3 0.1 Mark … … … Michelle … … … Barack … … …  Per-user topic distribution  Per-topic word distribution  Per-user followship preference  Per-topic followee distribution  Global followee distribution web cookie veggie organic congress … Tech 0.3 0.1 0.001 0.001 0.001 … Food 0.001 0.15 0.3 0.1 0.001 … Poli 0.005 0.001 0.001 0.002 0.25 … Alice Bob Mark Michelle Barack Tech 0.1 0.05 0.7 0.05 0.1 Food 0.04 0.1 0.06 0.75 0.05 Poli 0.03 0.02 0.05 0.1 0.8 Y N Alice 0.75 0.25 Bob 0.5 0.5 Mark … … Michelle … … Barack … … Alice Bob Mark Michelle Barack Global 0.005 0.001 0.244 0.25 0.5 User Topic Topic Topic Word User User Preference User Topic-specific influence Global popularity
  • 10. Generative Process of FLDA For the mth user: Pick a per-user topic distribution θm Pick a per-user followship preference μm For the nth word • Pick a topic based on topic distribution θm • Pick a word based on per-topic word distribution φt For the lth followee • Choose the cause based on followship preference μm • If content-related  Pick a topic based on topic distribution θm  Pick a followee based on per-topic followee distribution σt • Otherwise  Pick a followee based on global followee distribution π Alice Plate Notation Content: Followees: user link word Tech Tech: 0.8, Food: 0.19, Poli: 0.01 Follow for content: content 0.75, not: 0.25 0.2 web , organic, … Michelle Obama , Barack Obama, … © 10 2013 IBM Corporation
  • 11. Road Map  Followship-LDA (FLDA) Model  Scalable Gibbs Sampling Algorithm for FLDA  A General Search Framework for Topic-Specific Influencers  Experimental Results © 11 2013 IBM Corporation
  • 12. Inference of FLDA Model  Gibbs Sampling: A Markov chain Monte Carlo algorithm to approximate the distribution of latent variables based on the observed data.  Gibbs Sampling Process • Begin with some initial value for each variable • Iteratively sample each variable conditioned on the current values of the other variables, then update the variable with its new value (100s of iterations) • The samples are used to approximate distributions of latent variables  Derived conditional distributions for FLDA Gibbs Sampling p(zm,n=t|-zm,n): prob. topic of nth word of mth user is t given the current values of all others p(ym,l=r|-ym,l): prob. preference of lth link of mth user is r given the current values of all others p(xm,l=t|-xm,l, ym,l=1): prob. topic of lth link of mth user is t given the current values of all others © 12 2013 IBM Corporation
  • 13. © 2013 IBM Corporation Gibbs Sampling for FLDA  In each pass of data, for the mth user • For the nth word (observed value w),  Sample a new topic assignment t’, based on p(zm,n|-zm,n) • For lth followee (observed value e),  Sample a new preference r’, based on p(ym,l|-ym,l)  If r’=1 (content based), sample a new topic t’, based on p(xm,l|-xm,l, ym,l=1) • Keep counters while sampling  cm w,t: # times w is assigned to t for mth user  dm e,t : # times e is assigned to t for mth user  Estimate posterior distributions for latent variables per-user topic distribution per-topic followee distribution (influence)
  • 14. Distributed Gibbs Sampling for FLDA  Challenge: Gibbs Sampling process is sequential • Each sample step relies on the most recent values of all other variables. • Sequential algorithm would run for 21 days, on a high end server (192 GB RAM, 3.2GHz processor) for a Twitter data set with 1.8M users, 2.4B words and 183M links.  Observation: dependency between variable assignments is relatively weak, given the abundance of words and links  Solution: Distributed Gibbs Sampling for FLDA • Relax the sequential requirement of Gibbs Sampling • Implemented on Spark – a distributed processing framework for iterative machine learning workloads © 14 2013 IBM Corporation
  • 15. Distributed Gibbs Sampling for FLDA global counters Local counters User 1 User 2 … Local counters global counters User 13 User 14 … Global Counter: ce,t, … * User 25 User 26 … Word W1 W2 W3 Topic T5 T4 T5 Link L1 L2 Prefer Y N Topic T4 . Local Counter: cw,t, de,t, … m m w,t, d* master global counters Local counters worker worker worker © 15 2013 IBM Corporation
  • 16. © 2013 IBM Corporation Issues with Distributed Gibbs Sampling  Check-pointing for fault tolerance • Spark used lineage for fault tolerance  Frequency of global synchronization • Even every 10 iterations won’t affect the quality of result  Random number generator in a distributed environment • Provably independent multiple streams of uniform numbers • A long-period jump ahead F2-Linear Random Number Generator  Efficient local computation • Aware of the local memory hierachy • Avoid random memory access by sampling in order
  • 17. Road Map  Followship-LDA (FLDA) Model  Scalable Gibbs Sampling Algorithm for FLDA  A General Search Framework for Topic-Specific Influencers  Experimental Results © 17 2013 IBM Corporation
  • 18. Search Framework For Topic-Specific Influencers  SKIT: search framework for topic-spefic key influencers • Input: key words • Output: ranked list of key influencers • Can plug in various key influencer methods  FLDA, Link-LDA, Topic-Sensitive PageRank (TSPR), TwitterRank  Step 1: Derivation of interested topics SSKKIITT from key words • Treat the key words as the content of a new user • “Fold in” to the existing model • Result: a topic distribution of the input <p1, p2, …, pt, …>  Step 2: Compute influence score for each user • Per-topic influence score for each user from FLDA: infu t • Combine the score infu=Ʃ pt x infu t healthy food Michelle Obama Jimmy Oliver …  Step 3: Sort users by influence scores and return the top influencers © 18 2013 IBM Corporation
  • 19. Road Map  Followship-LDA (FLDA) Model  Scalable Gibbs Sampling Algorithm for FLDA  A General Search Framework for Topic-Specific Influencers  Experimental Results © 19 2013 IBM Corporation
  • 20. Effectiveness on Twitter Dataset  Twitter Dataset crawled in 2010 • 1.76M users, 2.36B words, 183M links, 159K distinct words Topic Top-10 key words Top-5 Influencers “Information Technology” data, web, cloud, software, open, windows, Microsoft, server, security, code “Cycling and Running” Bike, ride, race, training, running, miles, team, workout, marathon, fitness “Jobs” Business, job, jobs, management, manger, sales, services, company, service, small, hiring  Interesting findings Tim O’ Reilly, Gartner Inc, Scott Hanselman (software blogger), Jeff Atwood (software blogger, co-founder of stackoverflow.com), Elijah Manor (software blogger) Lance Armstrong, Levi Leipheimer, Geroge Hincapie (all 3 US Postal pro cycling team members), Johan Bruyeel (US Postal team director), RSLT (a pro cycling team) Job-hunt.org, jobsguy.com, integritystaffing.com/blog (job search Ninja), jobConcierge.com, careerealism.com • Overall ~ 15% of followships are content-independent • Top-5 global popular users: Pete Wentz (singer), Ashton Kutcher (actor), Greg Grunberg (actor), Britney Spears (singer), Ellen Degeneres (comedian) © 20 2013 IBM Corporation
  • 21. Effectiveness on Tencent Weibo Dataset  Tencent Weibo Dataset form KDD Cup 2012 • 2.33M users, 492M words, 51M links, 714K distinct words • VIP users – used as “ground truth”  Organized in categories  Manually labeled “key influencers” in the corresponding categories • Quality evaluation  Search Input: content of the VIP user  Precision: how many of the top-k results are the other VIP users of the same category  Measure: Mean Average Precision (MAP) for all the VIP users across all categories  FLDA is significantly better than previous approaches • 2x better than TSPR and TwitterRank • 1.6x better than Link-LDA  Distributed FLDA achieves similar quality results as sequential FLDA © 21 2013 IBM Corporation
  • 22. Efficiency of Distributed FLDA  Setup • Sequential: a high end server (192 GB RAM, 3.2GHz) • Distributed: 27 servers (32GB RAM, 2.5GHz, 8 cores) connected by 1Gbit Ethernet, 1 master and 200 workers  Dataset • Tencent Weibo: 2.33M users, 492M words, 51M links • Twitter: 1.76M users, 2.36B words, 183M links  Sequential vs. Distributed • 100 topics, 500 iterations Sequential Distributed Weibo 4.6 days 8 hours Twitter 21 days 1.5 days © 22 2013 IBM Corporation
  • 23. © 2013 IBM Corporation Scalability of Distributed FLDA  Scale up experiment on Twitter Dataset 23 corpus size=12.5% corpus size=25% corpus size=50% corpus size=100% 3000 2000 1000 0 3000 2000 1000 0 25 50 100 200 25 50 100 200 #concurrent workers Wall clock time per Iteration (sec) nTopics 25 50 100 200
  • 24. Conclusion  FLDA model for topic-specific influence analysis • Content + Links  Distributed Gibbs Sampling algorithm for FLDA • Comparable results as sequential Gibbs Sampling algorithm • Scales nicely with # workers and data sizes  General search framework for topic-specific key influencers • Can plug in various key influencer methods  Significantly higher quality results than previous work • 2x better than PageRank based methods • 1.6x better than Link-LDA © 24 2013 IBM Corporation

Notes de l'éditeur

  1. I am Yuanyuan tian from IBM research. Today, in this talk I will focus on one recent project on scalable topic-specific influence analysis. I chose this project talk about because it touches on both analytics and system aspects of research, therefore It is a good representative on the kinds of work we are doing in the information management group in IBM Almaden.
  2. Microblogging services, such as Twitter (twitter.com), have gained tremendous popularity in recent years. A large amount of microblog data has been accumulated overtime. According to a March 2012 report, Twitter had over 500 million users creating over 340 million tweets daily. Note the data contains both textual content reflected in the tweets, as well as social relationship information reflected by the follwer and follwee relationship. The rich text and social information in microblogs has become a popular resource for marketing campaigns to monitor the opinions of consumers on particular products and to launch viral advertising. Identifying key influencers in microblogs is required for such marketing activities.
  3. Although a lot of work has been done on social influence analysis, most of these studies infer influence only from the network structure, while ignoring the valuable text content that the users created. One of the most well-known studies is influcence maximazation work by Jon Kleinberg et al. As a result, the learned influence of each user is only global, with no way to assess the influence in a particular aspect of life (topic). For example, no one can deny that Presendent Obama is a key influencer in general, but if you want to advertise a database product, he is unlikely to influential, Whereas a database expert who won’t be identified as a general key influencer, probably has more say on this subject. So, what we want is Topic-spefic influcen analysis that can differentiate the influence in different aspects of life or different topics. In order to do that, we need to analyze not just the network structure but also the valuable textual content.
  4. A number of PageRank-based methods, such as Topic-Sensitive PageRank [13] and TwitterRank [25], are able to compute per-topic ranks scores, but they require a separate process to create the topics from the text, and then for each topic apply the influcence analysis using the network structure. As content and links are related to each other in a microblog network, the separation between the analysis on content and the analysis on the network structure usually leads to inferior performance. A few existing works can analyze content and network together, such as Link-LDA. However, they were all designed for citation and hyperlink networks, and assume that links are soly caused by content. This assumption clearly does not apply to microblogs, since it is prevalent for a user to follow celebrities simply because of their fame and stardom, with nothing to do with what he/she actually tweets about.
  5. The goal of this work is to support searching for topic-specific key influceners on microblogs. Altimately, we want to provide a search framework where users can simply type in key words to express his/her interested topic or combination of topics, and the search engine returns a ranked list of users who are influential in the corresponding subjects. In order to do that we need to first correctly model the topic-specific influence in microblog networks, and then learn the influence efficiently. These two are the focus of this talk. I will also briefly talk about how to put everything together in a search framework for topic-specfic influencers.
  6. To meet the computational challenge posed by rapid growing microblog data, propose a distributed Gibbs Sampling algorithm to the FLDA model. Then we incorporate our proposed method in a general search framework for topic-specif influcencer. After that, I will present some experimental results. Now, let’s first talk about the new FLDA model.
  7. Before I go into the details of the topic specific model, Let me first provide the intuition behindthe model. In a micrblog network, each user has both content which is a bag of words, and link structure whch is a set of followees. A user tweets in multiple topics. Based on the content of Alice, looks like she likes tweets about technology and food. Then a topic can be viewed as a mixture of words. Eg. The words web and cloud are likely to appear in the technology topic. As for the relationships among the users in a microblog network, there are dfferent reasons why a user follows another. Sometimes, a followship is content-based, because the followee tweets in similar topics. Other times, it is completely content-independent, because it is very prevalat for a user to follow a celebrity just because of the fame and stardom with nothing to do with what the follower tweets about. Finally, each topic should have a mixture of followees. In other words, given a topic, some users are more likely to be followed than others. For example, Mark Zuckerberg is more likely to be followed for the technology topic. We measure the topic-specific influence by the probability of a user u being followed for a given topic. So, if a user has a higher probabiity to be followed given an topic t, he/she is also more influentual in that topic.
  8. To correctly model topic-specific influence on microblogs, we propose a new Bayesian model, called Followship-LDA (FLDA). It is A Bayesian generative model that extends Latent Dirichlet Allocation (LDA). The reason why call it a generative model is because this model specifies a probablistic procedure based on our described intuition by which the content and links of each user are generated in a microblog network. In order to explain the creation of content and links, we introduce some hidden structure or latent variables in our genertive process, including topics, the reasons of followships, the topic-specific influcence. Finally, given the model, and observed data, the goal is to reverse the generative process and find out what hidden structure is most likely to have generated the observed data.
  9. Now let’s look at what hidden structures we have introduced in the generative model. Each user tweets in different topics. . Then each user has a topic distribution indicating how likely he/she tweets in different topics. Suppose, there are 3 latent topics, Tech, food and politics For example, user Alice tweets about tech 80% of time, food 19% time, and politics for the remaining 1% of time. Now, each topic is a mixture of words. So, each topic has a per-topic word distribution, indicating how likely different words are used in this topic. For example, for tech topic, the word web will be used 30% time, cookie 10% time, etc. As I mentioned before, there different reasons why a user follows another. So, each user has a preference of followship. For example, Alice follows for content 75% time, and the other 25% time she follows for popularity. Since some followships are content based. for each topic, we have a followee distribution, indicating the probabality of user being followed for a given topic. For example, for the tech topic, Mark Zuckerberg will be followed 70% time. Each number in this table is the probability of a user being followed by someone given a topic. This is exactly the topic-specific influence score we are after. If a user has a higher probability of being followed by someone, then this user has a higher influence on this topic. At the end, we also a global followee distribution, indicating the probabliity of a user being followed for content-independent reasons. For example, if afollowship is totally content-independent, then 50% of time, obabma will be the followee. It measures the global popularity of each user.
  10. Now let’s describe the generative process of FLDA. This figure shows the plate notation of the FLDA model. (The boxes represent replication. The outer box represents repetation for the users, the inner right box represents repeated generations of words and the left inner box represents the repeated creation of links.) Don’t worry, if you don’t know the plate notation. You can still understand the generative process The generative process first repeat for each user. Now for the m-th user, Say Alice. It first pick a per-user topic distribution from a Dirichlet prior. For example, …. In addition, it also picks a per-user followhsip preference for this user. Now, we first generate the content of this user. To generate each word, we first choose a topic based on the topic-distribution, say we choose Tech, then pick a word to represent this topic from the per-topic word distribution. In our example, web is chosen. The process continues for the remaining words. After the content generation, we now generate the followees. For each followee, we first choose the reaon of the followship based on the per-user followship preference. For example, this followship is based on content. Then we pick a topic based on the same topic-distribution as in the content generation, followed by picking a followee who well address the picked topic from the per-topic followee distribution. For a different link, the generative process may decide that the followship is content-independent. In this case, a followee is chosen based on the global popularity distribution. So, this is how content and links are created in generative process of FLDA.
  11. Now we have the topic specific influcen model, let’s look at how to learn the model based on the observed data.
  12. FLDA a probablistic procedure with introduced latent variables to generate content and links of each user in a microblog network. Now given the observed text and links, we want to find out the various distributions of latent variables. We used the Gibbs sampling to do the inference because it is the mostly widely used approach for approximate the distribution of latent variables especially for high dimensional data To learn the various distributions in the FLDA model, we use the Gibbs sampling method. Gibbs sampling is a Markov chain Monte Carlo algorithm to approximate distributions of latent variables based on the observed data. The Gibbs sampling process usually starts with some initial value for each variable, then iteratively sample each variable conditioned on the current values of the remaining variables, and then update the vairable with its new value. This process will repeat for 100s of iterations. And at the end, the produced samples can be used to approximate the distributions of latent variablw. The key and also the most challenging part of Gibbs sampling is to derive the conditional distributions of each latent variable. Here are the derived conditional probabilities for the FLDA model. They look very complicated. But essentially, what we derived are.. Gibbs sampling is a widely used approach to Bayesian Inference. We also use it here to learn the various distributions. Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximately from a specified multivariate probability distribution,This sequence can be used to approximate the joint distribution and approximate the marginal distribution of latent variables. The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables. We begin with some initial value for each variable, for each sample, sample each varibale from the conditional distribution. That is sample each variable from the distribution of that variable condidtioned on all other variables, making use of the most recent values and updating the varibales with its new value as soon as it has been sampled.. Then samples then approximate the join distribution of all variables. The idea is that observed data is incorporated into the sampling process by creating separate variables for each piece of observed data and fixing the variables in question to their observed values, rather than sampling from those variables. The distribution of the remaining variables is then effectively a posterior distribution conditioned on the observed data. A collapsed Gibbs sampler integrates out (marginalizes over) one or more variables when sampling for some other variable.
  13. Now I will give an overview of the gibbs sampling for flda. After initialized of all the latent variables, the algorithm exexutes in iterations. Each iteration, the algrithm will make one pass of the data. During each pass, for each user, we first sample on all the words, for the nth word of the mth user, we have the observed value of the word. We sample a new topic assignment of t’. After all words, we start to sample on the followees. For each followee, we first sample a new followship preference, if the proference is content based, we then sample a topic for the link. During the sample process, we keep track a number of counters, because they are used in definition of the conditional distributions for the sampling process. After the sampling process, again, we use the counters to estimate the posteriro distributions of latent variables, such as the per-user topic distributions.
  14. The Gibbs sampling process we have described so far is inherently sequential. Each sample step relies on the most recent values of all the other variables. sequential algorithm does scale. For example, Sequential algorithm would run for 21 days, on a high end server (192 GB RAM, 3.2GHz processor) for a Twitter data set with 1.8M users, 2.4B words and 183M links. So, we definitely need to parallelize the computation. How can we parallelize a inherently sequential process. We notice that for our problem, there a huge number of words and large number of links. The dependency of between different variable is relatively weak. So, we proposed to relax the sequential constraint for Gibbs sampling and propose a distributed gibbs sampling method. We implemented our distributed algorithm on top of spark, which is a distributed processing framwork for iterative machine learning workloads developed by Bekerley Amplab.
  15. Here is an overview of how it works. In a Spark cluster, there is a master and number of workers. We partition the set of users on to different workers. For each user, we hold the last topic assignment of each word. The last preference assignment and the last topic assignment for each link. Finally, each user holds the user-local counters, such as the # times a word is assigned to a topic for this user. The master is responsible for keeping track of the global counters, e.g. the #times a word is assigned to a topic across all users. The algorithm executes in 1 number ofiterations, at the beginning of each iteration, the master first broadcasts the global counters to all workers. Then each worker makes use of the global counters to sample and update the data of all its users. As the process goes on, the global counters become out of date. So, at the end of the iteration, the workers send their new local counters to the master. Then the master use them to update the global counters. After that the new iteration begins.
  16. The distributed algorithm seems pretty simple, but there are a number of issues that we need to handel with special care. First of all, we need to ensure the fault tolerance of the algorithm, because a gibbs sampling requires 100s of iterations, we don’t want to restart whenever failure occurs. At time we developed the distributed algorithm. Spark relied on lineage for fault toleration. So, if a worker fails, it needs to restart from the very beginning. This is not desirable for an algorithm that needs to run 100s of iterations. So, we implemented a chekcpointing mechanism in spark. But we found out in the latest release of spark, a checking pointing mechnisam was introduced. The 2nd issue is to the freqency of global synchronization. As I mentioned before, we synchronize all the global counters for each iteration. We also tried out various other frequencies and found out even with synchronizing every 10 iterations, the quality of result is not affected. Another more sutle issue if the random number generator. Because, workers are spawned at roughly same time, if we just use the jave random number generators, we will have correlations between the psedo random numbers generated across the works. This will jeopardized the qualitof the retured results. In order to guaranee the correctness of the distributed monta carlo simulation, we need provable independent multiple streams of uniform numbers. We used a long-period, jump ahead random number generator. The last issue deals with the efficient local computation. We took extra care to take advantage of local memory hierachy and avoid random memory access by sample in a particular order.
  17. Finally, to put everything togrther, we incorporate our FLDA model into a search frame work for topic-specific influencers, called SKIT. Using skit, a user simple enters a set of key words to express his/her interest, and SKIT returns a list of key influencers that satisfy the user’s intent. SKIT is a general search framework. Besides FLDA, it can also pluggin other key influencer methods, such as link-lda, topic-sensitive pagerank, twitterrank. Now, I will only focus on how it is implemented using FLDA. In order to support the search, SKIT first needs to derive the interested topics from the key words. In FLDA, we can simply treat the key words as the content of a new user and using the “folding in” approach to quickly compute the topic distribution of is new user. Each value indicates the probablity the query is on a particular topic. From LDA, we also get the per-topic influnce score for each user. So, to compute the influence score of a user on the key word query, we simply need to compute the weighted sum across all topics. At the end, we will sort the users by their influence score and return the top influencers.
  18. Now I will present some performance numbers.
  19. We first check whether the top influencers returned by our method make sense or not. Here, we use a Twitter dataset crawled in 2010. It contains …. In this table, we showed some example topics with their top key words and top influencers produced by FLDA. Here we named the topics for better presentation. Intuitively, it is clear that the influencers are very relevant to the corresponding topics. For example, one would expect O’Reily publishers, Gartner research, and popular software bloggers to be influential for an IT-related topic. And pro-cycling atheletes, pro cycling team and team director to be influential for a topic related to cycling and running. FLDA separated the “globally” popular users from the content specific influencers, and elected that 15% of all links were content independent. In other words, 15% of the time, these popular users were followed regardless of what people tweet about. And the top-5 global popular users are singers, actors and talk show host.
  20. anecdotal evidence is very hard to generalize and quantify. Luckily, 2012 KDD Cup provided us with the data needed to objectively measure the quality of FLDA and other approaches. This Tencent Weibo dataset contains ….. A very nice feature of the Weibo dataset is the set of provided VIP users. These VIP users are organized in categories. One example category is …. These VIP users are manually labelled “key influencers” in their corresponding category. Therefore, we can use them as “ground truth” to evaluate the quality of results. Note that the category doesn’t not necessary align with the topics we detected in FLDA. They represent a combinition of topics. If we use all the content of a VIP user as the search input, then we can check how many of the top-k results are the other VIP users in the same category and use the percentage as the precison. Overall, we can compute the mean Average Precison of all the VIP users across all categories. This chart compares the FLDA with existing approaches, such as Link-lda, TSPR and twitter rank. As we can see, FLDA produces significantly better results than existing approaches. It is over 2x better than TSPR and TwitterRank and 1.6x better than Link-LDA. We also compare the result of the sequential algorithm against the distributed algorithm. They produce comparable result, which confirms our intuition.
  21. Now we know sequential algorithm and the distributed algorithm produce comparable results, we next compare the execution time of the two. The sequential algorithm was run on a high end server … and the distributed algorithm was run on a Spark cluster with 27 servers. Again, we run on the Weibo and Twitter dataset. Note that Twitter dataset is the larger dataset. although the twitter dataset has fewer number of users, but it has significant more # words and #links. This table shows the execution time for the sequential and the distributed flda algorithms running 500 iterations. For the Weibo dataset, distributed algorithm reduces the runtime from 4.6days to 8 hours, whereas for twitter dataset, it reducers the runtime from 21 days to 1.5 days, more than an order of mangnitude faster.
  22. We now evaluate the scalability of the distributed algorithm long three dimensions: data size, number of topics, and the number of con- current workers. . We explore a wide range of sizes (from 12.5% all the way up to 100%), number of topics (from 25 to 200) and number of workers (from 25 to 200). The figure shows that the distributed FLDA scales well along all dimensions.
  23. To summarize, we proposed a novel flda model for topic-specific influence analysis. This model combines content and link structure in the same generative process and is able to differenetiate the different reasons why one follows another. in order to apply FLDA to a webscale microblog network, we design a distributed Gibbs sampling algorithm for FLDA. Finally, the FLDA model is incorporated in a proposed general search framework for topic-specific key influencers. Through experiments on two real-world microblog datasets, we demonstrate that FLDA significantly outperforms the state-of-the art methods in terms of precision. Furthermore, the distributed Gibbs sampling algorithm for FLDA provides excellent speed-up to hundreds of workers.