SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Mining Social Data
     FOSDEM 2013
Credits
 Speaker    Romeu "@malk_zameth" MOURA
Company     @linagora
  License   CC-BY-SA 3.0
SlideShar   j.mp/XXgBAn
        e   ● Mining Graph Data
  Sources   ● Mining the Social Web
            ● Social Network Analysis for
               startups
            ● Social Media Mining and Social
               Network Analysis
            ● Graph Mining
I work at Linagora, a french FLOSS
co.
EloData &
      OpenGraphMiner
Linagora's foray into ESN, DataStorage, Graphs &
                      Mining.
Why mine social data at
         all?
    Without being a creepy stalker
To see what humans
       can't.
  Influence, centers of interest.
To remeber what humans
        can't.
What worked in the past? Objectively how did I behave
                     until now?
To discover what humans
         won't.
Serendipity
Find what tou were not looking for
Real life social data
   What is so specific about it?
Always graphs
Dense substructures
  Every Vertex is an unique entity (someone).
Several dense subgraphs: Relations of poaches of
                     people
Usually it has no good cuts
  Even the best partition algorithms cannot find
        partitions that are just not there
There will be
errors & unknowns
 Exact matching is not an option
Plenty of vanity metrics
       pollution.
    Sometimes very surprising ones.
Number of followers is a
    vanity metric
@GuyKawasaki (~1.5M followers) is much more
 retweeted than the user with most followers
           (@justinbieber, ~34M)
Why use graphs?
What is the itch with Inductive Logic that Inductive
                  Graphs scratch?
'Classic' Data Mining
       Pros and cons
pro: Solid known
   techniques
   of good performance
con: Complex structures
     are translated
 Into Bayesian Networks or Multi-Relational tables:
Incurring either data loss or combinatory explosion.
Graph Mining
  'The new deal'
pro: Expressiveness
         and simplicity
The input and output are graphs, no conversions, graph
                algorithms all around.
con: The unit of operation
       is comparing
      isomorphisms
         NP-Complete
Extraction
 Getting the data
Is the easy part
  A commodity really.
Social networks provide
          API
Facebook Graph api, Twitter REST api, yammer api
                      etc.
Worst case:
Crawl the website
Crawling The Web For Fun And Profit:
   http://youtu.be/eQtxbaw__W8
import sys
import json
import twitter
import networkx as nx
from recipe__get_rt_origins import get_rt_origins

def create_rt_graph(tweets):
  g = nx.DiGraph()
  for tweet in tweets:
     rt_origins = get_rt_origins(tweet)
     if not rt_origins:
        continue
     for rt_origin in rt_origins:
        g.add_edge(rt_origin.encode('ascii', 'ignore'),
                tweet['from_user'].encode('ascii', 'ignore'),
                {'tweet_id': tweet['id']}
        )
  return g

if __name__ == '__main__':
   Q = ' '.join(sys.argv[1])
   MAX_PAGES = 15
   RESULTS_PER_PAGE = 100
   twitter_search = twitter.Twitter(domain='search.twitter.com')
   search_results = []
   for page in range(1,MAX_PAGES+1):
      search_results.append(
         twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page)
      )
   all_tweets = [tweet for page in search_results for tweet in page['results']]
   g = create_rt_graph(all_tweets)

  print >> sys.stderr, "Number nodes:", g.number_of_nodes()
   print >> sys.stderr, "Num edges:", g.number_of_edges()
   print >> sys.stderr, "Num connected components:",
                len(nx.connected_components(g.to_undirected()))
   print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))
Finding patterns
  substructures that repeat
Older options
Apriori-based, Pattern growth
Stepwise pair expansion
Separate the graph by pairs, count frequencies, keep
   most frequent, augment them by one repeat.
"Chunk": Separate the graph by pairs
Keep only the frequent ones
Expand them
Find your frequent pattern
con: Chunkiness
"ChunkingLess"
Graph Based Induction
     CL-CBI [Cook et. al.]
Inputs needed
1. Minimal frequency where we consider a
   conformation to be a pattern : threshold
2. Number of most frequent pattern we will
   retain : beam size
3. Arbitrary number of times we will iterate:
   levels
1. "Chunk": Separate the graph by
pairs
2. Select beam-size most frequent
ones
3. Turn selected pairs into pseudo-
nodes
4. Expand & Rechunk
Keep going back to step 2
    Until you have done it levels times.
Decision Trees
A Tree of patterns
Finding a pattern on a branch yields a decision
DT-CLGBI
DT-CLGBI(graph: D)
begin
 create_node DT in D
 if thresold-attained
    return DT
 else
   P <- select_most_discriminative(CL-CBI(D))
    (Dy, Dn) <- branch_DT_on_predicate(p)
   for Di <- Dy
     DT.branch_yes.add-child(DT-CLGBI(Di))
   for Di <- Dn
     DT.branch_no.add-child(DT-CLGBI(Di))

Contenu connexe

Tendances

Social Media Data Mining
Social Media Data MiningSocial Media Data Mining
Social Media Data Mining
Ryan Reede
 
Community analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networksCommunity analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networks
Marco Brambilla
 
Social Information & Browsing March 6
Social Information & Browsing   March 6Social Information & Browsing   March 6
Social Information & Browsing March 6
sritikumar
 

Tendances (20)

Social Media Data Mining
Social Media Data MiningSocial Media Data Mining
Social Media Data Mining
 
Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An Introduction
 
Social Targeting: Understanding Social Media Data Mining & Analysis
Social Targeting: Understanding Social Media Data Mining & AnalysisSocial Targeting: Understanding Social Media Data Mining & Analysis
Social Targeting: Understanding Social Media Data Mining & Analysis
 
30 Tools and Tips to Speed Up Your Digital Workflow
30 Tools and Tips to Speed Up Your Digital Workflow 30 Tools and Tips to Speed Up Your Digital Workflow
30 Tools and Tips to Speed Up Your Digital Workflow
 
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT ToolsIntroduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
 
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkSampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social Network
 
Big Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachBig Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network Approach
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
 
nm
nmnm
nm
 
Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
 
Social Media Mining - Chapter 6 (Community Analysis)
Social Media Mining - Chapter 6 (Community Analysis)Social Media Mining - Chapter 6 (Community Analysis)
Social Media Mining - Chapter 6 (Community Analysis)
 
Predicting Social Interactions from Different Sources of Location-based Knowl...
Predicting Social Interactions from Different Sources of Location-based Knowl...Predicting Social Interactions from Different Sources of Location-based Knowl...
Predicting Social Interactions from Different Sources of Location-based Knowl...
 
Social Network Analysis (SNA)
Social Network Analysis (SNA)Social Network Analysis (SNA)
Social Network Analysis (SNA)
 
Community analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networksCommunity analysis using graph representation learning on social networks
Community analysis using graph representation learning on social networks
 
Social Information & Browsing March 6
Social Information & Browsing   March 6Social Information & Browsing   March 6
Social Information & Browsing March 6
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Social Media Mining - Chapter 10 (Behavior Analytics)
Social Media Mining - Chapter 10 (Behavior Analytics)Social Media Mining - Chapter 10 (Behavior Analytics)
Social Media Mining - Chapter 10 (Behavior Analytics)
 
Conversation graphs in Online Social Media
Conversation graphs in Online Social MediaConversation graphs in Online Social Media
Conversation graphs in Online Social Media
 
Identification of inference attacks on private Information from Social Networks
Identification of inference attacks on private Information from Social NetworksIdentification of inference attacks on private Information from Social Networks
Identification of inference attacks on private Information from Social Networks
 

Similaire à Mining social data

Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
Doug Needham
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
Ilya Grigorik
 

Similaire à Mining social data (20)

Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Computer investigatroy project c++ class 12
Computer investigatroy project c++ class 12Computer investigatroy project c++ class 12
Computer investigatroy project c++ class 12
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Deep Learning Demystified
Deep Learning DemystifiedDeep Learning Demystified
Deep Learning Demystified
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
From Data to Visualization, what happens in between?
From Data to Visualization, what happens in between?From Data to Visualization, what happens in between?
From Data to Visualization, what happens in between?
 
A leap around AI
A leap around AIA leap around AI
A leap around AI
 
Neo4j GraphDay Seattle- Sept19- graphs are ai
Neo4j GraphDay Seattle- Sept19-  graphs are aiNeo4j GraphDay Seattle- Sept19-  graphs are ai
Neo4j GraphDay Seattle- Sept19- graphs are ai
 
Tokens and Complex Systems
Tokens and Complex SystemsTokens and Complex Systems
Tokens and Complex Systems
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 

Dernier

Dernier (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Mining social data