SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Blogosphere: Research Issues, Tools
                  xxxx
           and Applications


               Franco Sánchez Huertas
                       (UCSP)



                EDA – June, 2010
21/06/2010           UCSP -FASH          1
Overview

•   Background: Web 2.0 and Social Networks
•   Blogosphere: Definition, Types, and Comparison
•   Blogosphere Research Issues
•   Tools and APIs
•   Data Collection
•   Searching the Influentials: The Top Bloggers
•   Conclusions




21/06/2010                   UCSP -FASH              2
Web 2.0 and Social Networks




21/06/2010          UCSP -FASH   3
Caracteristics of Web 2.0

• Rich Internet Applications
• User generated contents
• User enriched contents
• User developed widgets
• Collaborative environment: Participatory Web, Citizen
  journalism
• Thus, it leverages the power of the Long Tail with user
  generated data as the driving force
• More of a paradigm shift than a technology shift



21/06/2010                   UCSP -FASH                     4
Technology Overview of Web 2.0

• Cascading Style Sheets to aid in the separation of presentation and
  content
• Folksonomies (collaborative tagging, social classification, social
  indexing, and social tagging)
• REST and/or XML- and/or JSON-based APIs
• Rich Internet application techniques, often Ajax and/or Flex, Flash-
  based
• Semantically valid XHTML and HTML markup
• Syndication, aggregation and notification of data in RSS or Atom
  feeds
• mashups, merging content from different sources, client- and
  server-side
• Weblog-publishing tools
• wiki or forum software to support user-generated content

21/06/2010                      UCSP -FASH                               5
Some Web 2.0 Services
•   Blogs
      –      Blogspot
      –      Wordpress
      –      Lamula (Perú)
•   Wikis
      –      Wikipedia
      –      Wikiversity
•   Social Networking Sites
      –      Facebook
      –      Twitter
      –      MySpace
      –      Orkut
•   Digital media sharing websites
      –      Youtube
      –      Flickr
      –      Vimeo
      –      Twitpic
•   Social Tagging
      –      Del.icio.us



21/06/2010                           UCSP -FASH   6
Social Networks



• A social structure made of nodes (individuals or
  organizations) that are related to each other by various
  interdependencies like friendship, kinship, like, ...

• Graphical representation
      – Nodes = members
      – Edges = relationships




21/06/2010                      UCSP -FASH                   7
Social Networks




21/06/2010        UCSP -FASH   8
Social Networks

• A social structure made of nodes (individuals or
  organizations) that are related to each other by various
  interdependencies like friendship, kinship, like, ...
• Graphical representation
      – Nodes = members
      – Edges = relationships
• Various realizations
      –      Social bookmarking (Del.icio.us)
      –      Friendship networks (facebook, myspace)
      –      Blogosphere
      –      Media Sharing (Flickr, Youtube)
      –      Folksonomies

21/06/2010                         UCSP -FASH            9
BLOGOSPHERE




              Definitions, Types, and Comparison


21/06/2010    UCSP -FASH                      10
Blogging Phenomenon

• It’s growing fast as a new means for online
  communications and interactions
• A blogger could gain instant fame via his blogs
• A blogger may make a good living with her
  blogs
• Abundant, lucrative business opportunities
• A new political arena


21/06/2010            UCSP -FASH                11
Blog Structure

                 Blog Site




21/06/2010       UCSP -FASH   12
Blog Structure




                              Blog Post
21/06/2010       UCSP -FASH               13
Blog Structure




                 Blogger




21/06/2010                 UCSP -FASH   14
Types of Blogs

 • Individual vs. community
       – Single authored (Individual blog sites)
       – Multi authored (Community blog sites)
              Individual Blog Sites                              Community Blog Sites
                                                    Owned and maintained by a group of like-minded
Owned and maintained by individual users.
                                                    users.
                                                    More like discussion forums and discussion
More like personal accounts, journals or diaries.
                                                    boards.
                                                    High degree of group discussion and
No or almost negligible group interaction.
                                                    collaboration.
                                                    Enormous collective wisdom and open source
No or almost negligible collective wisdom.
                                                    intelligence.

 • Regulated vs. anonymous

 21/06/2010                                   UCSP -FASH                                         15
Blogosphere

• Complex Social Networks
• Vertices (Nodes): Bloggers/
  Blog posts/Blog sites
• Edges: Relationships/Links
• In-Degree: Number of
  inlinks
• Out-Degree: Number of
  outlinks




21/06/2010                  UCSP -FASH   16
Friendship Networks vs. Blogosphere
              Friendship Networks                                          Blogosphere
                                      Explicit Links/Edges Implicit Links/Edges

                                       Undirected Graph Directed Graph

                            Network Centrality Measures Blog Statistics

                          Quantifying Spread of Influence Quantifying Influential Members

                              Nodes are members/actors Nodes can be bloggers/blogs or blog sites

                          Strictly defined graph structure Loosely defined graph structure

                     “Being in touch” or “Making Friends” Sharing ideas and opinions

                                        Person-to-person Person-to-group

                                     Friendship Oriented Community Oriented


             Member’s Reputation/Trust based on network Member’s Reputation/Trust based on the response
               connections and/or location in the network to other member’s knowledge solicitations



21/06/2010                                         UCSP -FASH                                         17
Friendship Networks vs. Blogosphere


                          Social Networks
                                              Orkut, Facebook, LinkedIn,
                                              Classmates.com, etc.

                Social                        LiveJournal, MySpace, etc.
             Friendship   Blogosphere
             Networks                         TUAW, Blogger, Windows Live
                                              Spaces, etc.




21/06/2010                       UCSP -FASH                                 18
BLOGOSPHERE RESEARCH ISSUES




21/06/2010       UCSP -FASH   19
Understanding Blogosphere

• Blogosphere                      • Everyone can publish, but
• Blog sites                         few are heard
• Bloggers                         • Many interesting questions
• Blog posts                         to address
• Reverse chronologically                – How to build traffic
  ordered entries                        – How to find niche
• Blogroll                                 online
• Permalinks                             – How to increase
                                           influence
                                         – How to …
                                   • Fertile research domain
21/06/2010                  UCSP -FASH                            20
Understanding Blogosphere

• Understand structures and properties of Blogosphere
• Gain insights into the relationships between
  bloggers, readers, blog posts, comments, different
  blog sites in Blogosphere
• Models help generate artificial data, tune the
  parameters to simulate special scenarios, and
  compare various studies and different algorithms
• Study peculiarities in Blogosphere and infer latent
  patterns and structures that could explain certain
  phenomena like influence, diffusion, splogs,
  community discovery.

21/06/2010              UCSP -FASH                 21
Modeling Web and Blogosphere
• Some key differences between Web and Blogosphere
      – Models developed for Web assume dense graph structure due to a large
        number of interconnecting hyperlinks within webpages. This assumption does
        not hold true. Blogosphere is shown to have a very sparse hyperlink structure
        [Kritikopoulos et al. 2006].
      – The level of interaction in terms of comments and replies to blog posts makes
        Blogosphere different from Web
      – The highly dynamic and “short-lived” nature of the blog posts could not be
        simulated by the web models. Web models do not consider dynamicity in the
        web pages
      – Web models assume webpages accumulate links over time. However, this is
        not true with Blogosphere
      – “Categories” and “tags” gives blogs flexibility that conventional websites
        typically don’t have
      – Descriptive filenames used in permalinks of blogs as compared to webpage
        filenames

21/06/2010                             UCSP -FASH                                  22
Modeling Blogosphere
• Preferential attachment
      – Probability of a new edge to a node to be added depends on its degree
      – “The rich get richer”                          P(e : vi  v j )  deg( vi )
      – Power law distribution or scale free distribution




21/06/2010                              UCSP -FASH                                    23
Modeling Blogosphere
• Preferential attachment
      – Probability of a new edge to a node to be added depends on its degree
      – “The rich get richer”                          P(e : vi  v j )  deg( vi )
      – Power law distribution or scale free distribution




21/06/2010                              UCSP -FASH                                    24
Modeling Blogosphere
•   Preferential attachment              P(e : vi  v j )  deg( vi ) / V
      – Probability of a new edge to a node to be added depends on its degree
      – “The rich get richer”
      – Power law distribution or scale free distribution
•   Hybrid model             P(e : vi  v j )   deg( vi ) / V  (1   ) 
      – Mixture of both preferential attachment model and random model
      – Give a lucky poor guy some chance to get rich
      – To solve irreducibility (strong connectedness with few isolated subgraphs) random walk
        on a graph model proposes a random jump with a fixed probability
•   Leskovec et al. 2007 studied temporal patterns
      –      How often people create blog posts
      –      Busrtiness and popularity
      –      How these posts are linked and what is the link density
      –      Developed a SIS based model
•   Kumar et al. 2003 use blogrolls on the blog posts to construct a network of blog
    posts assuming that blogrolls contain similar blog posts

21/06/2010                                      UCSP -FASH                                   25
Blog Clustering




21/06/2010        UCSP -FASH   26
Blog Clustering

• Dynamic and automatic organization of the content
• Convenient accessibility
• Optimizing search engines by reducing search space
      – Search only the relevant cluster
•   Focused crawling
•   Summarization
•   Topic identification
•   Reduce information overload
      – 175,000 blog posts per day, i.e., 2 blog posts per second – Dec
        2006
• Extraction and analysis of the trends

21/06/2010                        UCSP -FASH                              27
tfidf i , j  tf i , j  idf i
Blog Clustering
                                                                                          ni , j
                                                                        tf i , j 
• Brooks and Montanez 2006, used tf-idf and                                               k
                                                                                               nk , j

  picked top 3 keywords for blog posts                                   log
                                                                                                D
                                                                                     d        : ti  d j 
                                           idf
      – Clustered blogs based on these keywords
                                                                    i
                                                                                          j

      – Reported improved clustering as compared to that using tags
• Li et al. 2007 assigned different weights to title, body,
  and comments of blog posts
      – Need to address high dimensionality and sparsity due to their
        keyword-based approach
• Agarwal et al. 2008 proposed a collective-wisdom
  based approach
      – Generate a category relation graph based on user assignments
      – Compute similarity matrix from this graph


21/06/2010                          UCSP -FASH                                                     28
Blog Mining
• Interactions between producers and consumers improved with blogs
• Consumers not only speak their mind but also broadcast their opinions
• Blogs are invaluable information sources
      –      consumers’ beliefs and opinions,
      –      initial reaction to a launch,
      –      understand consumer language,
      –      track trends and buzzwords, and
      –      fine-tune information needs
• Blog conversations leave behind the trails of links, useful for
  understanding how information flows and how opinions are shaped
  and influenced
• Tracking blogs also help in gaining deeper insights



21/06/2010                                 UCSP -FASH                29
Blog Influence

• Two types of influence
      – Influential blog sites and site networks [Gill 2004, Gruhl et al 2004, Java et al
        2006]
      – Influential bloggers in a community [Agarwal et al. 2008]
• Blogosphere vs. Friendship Networks
      –      Implicit vs. Explicit links
      –      Blog statistics vs. Centrality measures
      –      “influencing” vs. “could influence”
      –      Loosely vs. Strictly defined graph structures
• Blog vs. Webpage Ranking
      – Blog sites too sparse for webpage ranking algorithms to work [Kritikopoulos et
        al 2006]
      – Webpage acquires authority over time, blog posts’ influence diminishes
      – Greedy approach works better than PageRank, HITS to maximize influence
        flow [Kempe et al 2003, Richardson & Domingos 2002]

21/06/2010                                   UCSP -FASH                                     30
Issue of Trust

 • Open standards and low barriers to publishing have created
   overwhelming amount of collective wisdom
 • Yet more difficult for readers to discern whom to trust in
   some cases
 • Similar to WWW
       – Authoritative webpages e.g., HITS [Kleinberg et al. 1998], PageRank
         [Page et al. 1999]
 • Blogosphere allow mass to create and edit content
   compromising the sanctity of the original content
 • Some work exists for social friendship network domain, not
   many researchers have explored Blogosphere
 • Huge potential for trust study in Blogosphere domain


21/06/2010                          UCSP -FASH                                 31
Trust
 • Kale et al. 2007 transformed the problem of trust in
   blogosphere to the one in social friendship networks
       – Studied propagation of trust among different blog sites
       – Mined sentiments from a window of words around hyperlinks
       – Identified positive, negative, or neutral sentiments towards the linked
         blog site
       – Constructed a network of blog sites using hyperlinks
       – Used Gruhl et al. 2004 trust propagation algorithm
       – Some concerns
             • These blog sites have to be linked for trust propagation
             • Trust is computed between blog sites based on how much one blog
               agrees or disagrees with the other
                          Mi+1 = Mi * Ci – Perform till convergence
                          M = Belief Matrix; Ci = Atomic Propagation
                          Ci = M + MT*M + MT + M*MT
21/06/2010                                   UCSP -FASH                          32
Community Extraction

• Blogosphere doesn’t have an explicit notion of communities
• Different from blog clustering
• Researchers identify communities based on
      – Links: network of hyperlinks allows identification of virtual communities
          • Several studies on finding community of webpages like Kleinberg 1998
             and Kumar et al. 1999
          • While Kleinberg used authority and hubs idea to explore communities of
             webpages, Kumar et al. extended the idea of hubs and authorities and
             included co-citations as a way to extract all communities on the web and
             used graph theoretic algorithms to identify all instances of graph
             structures that reflect community characteristics.
      – Content: blogs with similar content or inspired by the same event form a
        virtual community
          • Kumar et al. 2003, Efimova and Hendrick 2005, Blanchard 2004


21/06/2010                              UCSP -FASH                                  33
Community Extraction

• Chin and Chignell 2006 proposed a model for finding
  communities taking the blogging behavior of bloggers into
  account
      – They aligned behavioral approaches through blog reader survey
        in studying blog community.
• Blanchard and Marcus 2004 studied a multiple sport
  newsgroup “Virtual Settlement” and analyzed the possibility
  of emerging virtual communities
      – Newsgroups and discussion forums are similar in terms of
        interaction patterns to Blogosphere
      – More person-to-group interaction rather than person-to-person
        interaction

21/06/2010                      UCSP -FASH                          34
Spam blog (Splogs) Filtering

•   One of the major rising concerns on Blogosphere
•   Spammers make most of their money by getting viewers to click on ads that
    run adjacent to their nonsensical text
•   Open standards and low barriers to publishing escalates the problem and
    challenges while solving
•   Besides degrading search quality, affects the network resources




21/06/2010                         UCSP -FASH                             35
Spam blog (Splogs) Filtering

•   One of the major rising concerns on Blogosphere
•   Open standards and low barriers to publishing escalates the problem and
    challenges while solving
•   Besides degrading search quality, affects the network resources
•   Initial researches applied web spam link detection approaches
      – Ntoulas et al. 2006, distinguish between normal web pages and spam
        webpages based on the statistical properties like
             • number of words, average length of words, anchor text, title keyword frequency,
               tokenized URL
      – Gyongyi et al. 2004, Gyongyi et al. 2006 use PageRank to compute the spam
        score of a webpage
•   Kolari et al. 2006, consider each blog post as a static webpage and use
    both content and hyperlinks to classify a blog post as spam using a SVM
    based classifier


21/06/2010                                   UCSP -FASH                                          36
Tools and API’s




                          Working in the Blogosphere…


21/06/2010        UCSP -FASH                        37
Analysis and Visualization Tools

• Tools
      – Data Analysis & Visualization tools
      – Statistics like centrality measures
• NetLogo (http://ccl.northwestern.edu/netlogo/)
      – Multi-agent programming language and modeling environment
        designed in Logo
      – Modelers can give instructions to hundreds or thousands of
        concurrently operating autonomous agents.
      – Exploring the connection between the individuals (micro-level) and
        the patterns that emerge from the interaction of many individuals
        (macro-level).



21/06/2010                           UCSP -FASH                              38
Analysis and Visualization Tools
• UCINet (http://www.analytictech.com/)
      – Package for the analysis of social network data including centrality
        measures, subgroup identification, role analysis, elementary graph
        theory, and permutation-based statistical analysis
      – Has strong matrix analysis routines, such as matrix algebra and
        multivariate statistics
• Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/)
      – Slovenian for spider
      – Analyzing and visualizing large networks like social networks
• Network package in R (http://cran.r-project.org/src/contrib/Descriptions/network.htm)
      – The network class can represent a range of relational data types, and
        support arbitrary vertex/edge/graph attributes
      – This is used to create and/or modify the network objects and is used
        for social network analysis (SNA)


21/06/2010                              UCSP -FASH                                        39
Analysis and Visualization Tools

• InFlow (http://www.orgnet.com/inflow3.html)
      – Integrated product for network analysis and visualization
      – Used in the SNA domain
• NetMiner (http://www.netminer.com/)
      – Tool for exploratory network data analysis and visualization
      – NetMiner allows to explore network data visually and
        interactively, and helps in detecting underlying patterns
        and structures of the network




21/06/2010                      UCSP -FASH                          40
APIs

• APIs
      – Data collection (blog posts, inlinks, tags, etc.)
      – Technorati
      – Digg
      – del.icio.us
      – Facebook
      – StumbleUpon



21/06/2010                    UCSP -FASH                    41
Technorati API

• bloginfo query
API url: http://api.technorati.com/bloginfo?key=[apikey]&url=[blog   url]

Sample response:
             <result>
               <url>[URL]</url>
               <weblog>
                 <name>[blog name]</name>
                 <url>[blog URL]</url>
                 <rssurl>[blog RSS URL]</rssurl>
                 <atomurl>[blog Atom URL]</atomurl>
                 <inboundblogs>[inbound blogs]</inboundblogs>
                 <inboundlinks>[inbound links]</inboundlinks>
                 <lastupdate>[date blog last updated]</lastupdate>
                 <rank>[blog ranking]</rank>
                 <lang></lang>
                 <foafurl>[blog foaf URL]</foafurl>
               </weblog>
             </result>

21/06/2010                        UCSP -FASH                                42
Technorati API

• BlogPostTags query
API url: http://api.technorati.com/blogposttags?key=[apikey]&url=[blog   url]

Sample response:

              <document>
                <result>
                  <querycount>[limit parameter]</querycount>
                </result>
                <item>
                  <tag>[tag name];/tag>
                  <posts>[tag count]</posts>
                </item>
              </document>




21/06/2010                        UCSP -FASH                               43
del.icio.us API

https://api.del.icio.us/v1/tags/get
Returns a list of tags and number of times used
Sample response

             <tags>
               <tag count="1"   tag="activedesktop" />
               <tag count="1"   tag="business" />
               <tag count="3"   tag="radio" />
               <tag count="5"   tag="xml" />
               <tag count="1"   tag="xp" />
               <tag count="1"   tag="xpi" />
             </tags>




21/06/2010                         UCSP -FASH            44
Data Collection




                               Using the Blogosphere…

21/06/2010        UCSP -FASH                      45
Available Datasets

• TREC (http://ir.dcs.gla.ac.uk/test_collections/blog06info.html)
      – A crawl of Feeds, and associated Permalink and homepage
        documents (from late 2005 and early 2006)
      – 100,649 feeds were polled once a week for 11 weeks
      – Total Number of Feeds collected:753,681
      – Average feeds collected every day:10,615
      – Uncompressed Size:38.6GB Compressed Size:8.0GB
      – Reasonably sized spam component for added realism
      – Fee: £400 ~ $794.36


21/06/2010                           UCSP -FASH                     46
Available Datasets

• Mobile Network (http://kdl.cs.umass.edu/data/msn/msn-info.html)
      –      27 objects
      –      over 180,000 links
      –      1 object attribute
      –      2 link attributes
• Other ways
      –      Crawl blogs
      –      Blogcatalog
      –      Statistics available from technorati API
      –      Tagging available from del.icio.us API


21/06/2010                           UCSP -FASH                     47
Data Crawler

• BlogTrackers
      – User interface to crawl blog sites
             • Scratch crawling (from blog archives)
             • Incremental crawling (from RSS feeds)
      – Stores the blog posts in Microsoft SQL server
      – Collects


              Blog post title                  Blog post tags
              Blog post content                Blog post permalink
              Outlinks                         Blogger name
              Inlinks                          Blog post date and time

      – Track blog posts like generate tag clouds for user specified time
        window
21/06/2010                               UCSP -FASH                         48
Collectable Statistics from Blogs

•   Inbound links
      – Blogs, blog post, webpage
•   Outbound links
      – Blogs, blog post, webpage
•   Comments
•   Blog server logs
•   Subscribers
•   Time to read/length
•   Links to post and incoming traffic from them
•   Links from post and outgoing traffic to them
•   Topic frequency score
•   Blogroll links
•   Tagged urls (del.icio.us, furl)


21/06/2010                          UCSP -FASH     49
Searching The Influentials : The Top Bloggers

• Active bloggers
      – Easy to define
      – Often listed at a blog site
      – Are they necessarily influential
• How to define an influential blogger?
      –      Influential bloggers have influential posts
      –      Subjective
      –      Collectable statistics
      –      How to use these statistics


21/06/2010                          UCSP -FASH             50
Intuitive Properties
• Social Gestures (statistics)
      – Recognition: Citations (incoming links)
               – An influential blog post is recognized by many. The more influential the
                 referring posts are, the more influential the referred post becomes.
      – Activity Generation: Volume of discussion (comments)
               – Amount of discussion initiated by a blog post can be measured by the
                 comments it receives. Large number of comments indicates that the blog
                 post affects many such that they care to write comments, hence
                 influential.
      – Novelty: Referring to (outgoing links)
               – Novel ideas exert more influence. Large number of outlinks suggests that
                 the blog post refers to several other blog posts, hence less novel.
      – Eloquence: “goodness” of a blog post (length)
               – An influential is often eloquent. Given the informal nature of
                 Blogosphere, there is no incentive for a blogger to write a lengthy piece
                 that bores the readers. Hence, a long post often suggests some necessity
                 of doing so.
• Influence Score = f(Social Gestures)


21/06/2010                               UCSP -FASH                                         51
Understanding the Influentials

• Are influential bloggers simply active bloggers?
• If not, in what ways are they different?
      – Can the model differentiate them?
• Are there different types of influential bloggers?
• What other parameters can we include to evolve
  the model?
• Are there temporal patterns of the influential
  bloggers?



21/06/2010                  UCSP -FASH                 52
Active & Influential Bloggers




•   Active and Influential Bloggers
•   Inactive but Influential Bloggers
•   Active but Non-influential Bloggers



•   They don’t consider “Inactive and Non-influential Bloggers”, because they
    seldom submit blog posts. Moreover, they do not influence others.


21/06/2010                            UCSP -FASH                                53
Conclusions…


Blogosphere is one of the fastest growing, social networking
media. The virtual communities in the blogosphere are not
constrained by physical proximity and allow anytime, anywhere,
and instant communications.

In this paper the autors discuss current research issues in
Blogosphere including modeling, blog clustering, blog mining,
community discovery and factorization, influence and
propagation, trust and reputation, and filtering spam blogs.



21/06/2010                  UCSP -FASH                      54
Questions




21/06/2010   UCSP -FASH   55

Contenu connexe

Tendances

Indexing presentation 2013 06-04
Indexing presentation 2013 06-04Indexing presentation 2013 06-04
Indexing presentation 2013 06-04
Louise Spiteri
 
UCLA X469.21 - Spring '16 Week 6
UCLA X469.21 - Spring '16 Week 6UCLA X469.21 - Spring '16 Week 6
UCLA X469.21 - Spring '16 Week 6
SocialMediaUCLA
 
Ologie Social Media Presentation
Ologie Social Media PresentationOlogie Social Media Presentation
Ologie Social Media Presentation
Leigh Householder
 

Tendances (17)

Why Metro Vancouver should be doing social media
Why Metro Vancouver should be doing social mediaWhy Metro Vancouver should be doing social media
Why Metro Vancouver should be doing social media
 
100628 acsi so-me-instructions_v3_win97-2003
100628 acsi so-me-instructions_v3_win97-2003100628 acsi so-me-instructions_v3_win97-2003
100628 acsi so-me-instructions_v3_win97-2003
 
Blogs, Wikis and more: Web 2.0 demystified for information professionals
Blogs, Wikis and more: Web 2.0 demystified for information professionalsBlogs, Wikis and more: Web 2.0 demystified for information professionals
Blogs, Wikis and more: Web 2.0 demystified for information professionals
 
Indexing presentation 2013 06-04
Indexing presentation 2013 06-04Indexing presentation 2013 06-04
Indexing presentation 2013 06-04
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Pln 101
Pln 101Pln 101
Pln 101
 
Networked Social Media in Learning, Teaching and Research
Networked Social Media in Learning, Teaching and ResearchNetworked Social Media in Learning, Teaching and Research
Networked Social Media in Learning, Teaching and Research
 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the Net
 
a paper on Social Networking Sites
a paper on Social Networking Sitesa paper on Social Networking Sites
a paper on Social Networking Sites
 
If you build it, will they come? Leveraging Social Media to Engage Students a...
If you build it, will they come? Leveraging Social Media to Engage Students a...If you build it, will they come? Leveraging Social Media to Engage Students a...
If you build it, will they come? Leveraging Social Media to Engage Students a...
 
Academic Web 2.0: An Introduction to Social and Participatory Media
Academic Web 2.0: An Introduction to Social and Participatory Media Academic Web 2.0: An Introduction to Social and Participatory Media
Academic Web 2.0: An Introduction to Social and Participatory Media
 
Web2 UKOLN MLA Workshop
Web2 UKOLN MLA WorkshopWeb2 UKOLN MLA Workshop
Web2 UKOLN MLA Workshop
 
Library2 Presentation
Library2 PresentationLibrary2 Presentation
Library2 Presentation
 
UCLA X469.21 - Spring '16 Week 6
UCLA X469.21 - Spring '16 Week 6UCLA X469.21 - Spring '16 Week 6
UCLA X469.21 - Spring '16 Week 6
 
Social Media 101 for Jewish Communal Professionals
Social Media 101 for Jewish Communal ProfessionalsSocial Media 101 for Jewish Communal Professionals
Social Media 101 for Jewish Communal Professionals
 
Gilbane 2011 - All the cool web kids are social, is your CMS ready to hang wi...
Gilbane 2011 - All the cool web kids are social, is your CMS ready to hang wi...Gilbane 2011 - All the cool web kids are social, is your CMS ready to hang wi...
Gilbane 2011 - All the cool web kids are social, is your CMS ready to hang wi...
 
Ologie Social Media Presentation
Ologie Social Media PresentationOlogie Social Media Presentation
Ologie Social Media Presentation
 

En vedette (6)

Snipets by FrancoSH
Snipets by FrancoSHSnipets by FrancoSH
Snipets by FrancoSH
 
Blogging ppt
Blogging pptBlogging ppt
Blogging ppt
 
Blog ppt
Blog pptBlog ppt
Blog ppt
 
Inaugural Addresses
Inaugural AddressesInaugural Addresses
Inaugural Addresses
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & Textspeak
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Similaire à Blogosphere by FrancoSH

Antunes.emma
Antunes.emmaAntunes.emma
Antunes.emma
NASAPMC
 
Tracking Social Media Participation: New Approaches to Studying User-Genera...
Tracking Social Media Participation: New Approaches to Studying User-Genera...Tracking Social Media Participation: New Approaches to Studying User-Genera...
Tracking Social Media Participation: New Approaches to Studying User-Genera...
Axel Bruns
 
Flicc Institute for Library Technicians 2011 @ the Library of Congress
Flicc Institute for Library Technicians 2011 @ the Library of CongressFlicc Institute for Library Technicians 2011 @ the Library of Congress
Flicc Institute for Library Technicians 2011 @ the Library of Congress
Aileen Marshall
 
User-generated metadata: Boon or bust for indexing and controlled vocabularies?
User-generated metadata: Boon or bust for indexing and controlled vocabularies?User-generated metadata: Boon or bust for indexing and controlled vocabularies?
User-generated metadata: Boon or bust for indexing and controlled vocabularies?
Louise Spiteri
 
User-Generated Metadata: Boon or Bust for Indexing and Controlled Vocabularies?
User-Generated Metadata: Boon or Bust for Indexing and Controlled Vocabularies?User-Generated Metadata: Boon or Bust for Indexing and Controlled Vocabularies?
User-Generated Metadata: Boon or Bust for Indexing and Controlled Vocabularies?
Louise Spiteri
 
Global Redirective Practices
Global Redirective PracticesGlobal Redirective Practices
Global Redirective Practices
adjwilli
 

Similaire à Blogosphere by FrancoSH (20)

Antunes.emma
Antunes.emmaAntunes.emma
Antunes.emma
 
Tracking Social Media Participation: New Approaches to Studying User-Genera...
Tracking Social Media Participation: New Approaches to Studying User-Genera...Tracking Social Media Participation: New Approaches to Studying User-Genera...
Tracking Social Media Participation: New Approaches to Studying User-Genera...
 
Tracking Social Media Participation: New Approaches to Studying User-Genera...
Tracking Social Media Participation: New Approaches to Studying User-Genera...Tracking Social Media Participation: New Approaches to Studying User-Genera...
Tracking Social Media Participation: New Approaches to Studying User-Genera...
 
Flicc Institute for Library Technicians 2011 @ the Library of Congress
Flicc Institute for Library Technicians 2011 @ the Library of CongressFlicc Institute for Library Technicians 2011 @ the Library of Congress
Flicc Institute for Library Technicians 2011 @ the Library of Congress
 
The Power of Known Peers: A Study in Two Domains
The Power of Known Peers: A Study in Two DomainsThe Power of Known Peers: A Study in Two Domains
The Power of Known Peers: A Study in Two Domains
 
Online Tools for Group Work
Online Tools for Group WorkOnline Tools for Group Work
Online Tools for Group Work
 
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the MapNew Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
 
Jeelani web 2.0
Jeelani web 2.0Jeelani web 2.0
Jeelani web 2.0
 
Making More Sense Out of Social Data
Making More Sense Out of Social DataMaking More Sense Out of Social Data
Making More Sense Out of Social Data
 
Web 2.0: its impact on library services
Web 2.0: its impact on library servicesWeb 2.0: its impact on library services
Web 2.0: its impact on library services
 
In Search of Australian Blogs: Determining the Extent of the Contemporary Aus...
In Search of Australian Blogs: Determining the Extent of the Contemporary Aus...In Search of Australian Blogs: Determining the Extent of the Contemporary Aus...
In Search of Australian Blogs: Determining the Extent of the Contemporary Aus...
 
Strong and weak ties: the change of online relationship from blogs to social ...
Strong and weak ties: the change of online relationship from blogs to social ...Strong and weak ties: the change of online relationship from blogs to social ...
Strong and weak ties: the change of online relationship from blogs to social ...
 
User-generated metadata: Boon or bust for indexing and controlled vocabularies?
User-generated metadata: Boon or bust for indexing and controlled vocabularies?User-generated metadata: Boon or bust for indexing and controlled vocabularies?
User-generated metadata: Boon or bust for indexing and controlled vocabularies?
 
User-Generated Metadata: Boon or Bust for Indexing and Controlled Vocabularies?
User-Generated Metadata: Boon or Bust for Indexing and Controlled Vocabularies?User-Generated Metadata: Boon or Bust for Indexing and Controlled Vocabularies?
User-Generated Metadata: Boon or Bust for Indexing and Controlled Vocabularies?
 
Crosby social media tools v2
Crosby   social media tools v2Crosby   social media tools v2
Crosby social media tools v2
 
Understanding Critical Elements of E-books: The Social Reading Experience of ...
Understanding Critical Elements of E-books: The Social Reading Experience of ...Understanding Critical Elements of E-books: The Social Reading Experience of ...
Understanding Critical Elements of E-books: The Social Reading Experience of ...
 
Social Media and Social Book Marking
Social Media and Social Book Marking Social Media and Social Book Marking
Social Media and Social Book Marking
 
Presentation
PresentationPresentation
Presentation
 
Global Redirective Practices
Global Redirective PracticesGlobal Redirective Practices
Global Redirective Practices
 
Online Community Training
Online Community TrainingOnline Community Training
Online Community Training
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Blogosphere by FrancoSH

  • 1. Blogosphere: Research Issues, Tools xxxx and Applications Franco Sánchez Huertas (UCSP) EDA – June, 2010 21/06/2010 UCSP -FASH 1
  • 2. Overview • Background: Web 2.0 and Social Networks • Blogosphere: Definition, Types, and Comparison • Blogosphere Research Issues • Tools and APIs • Data Collection • Searching the Influentials: The Top Bloggers • Conclusions 21/06/2010 UCSP -FASH 2
  • 3. Web 2.0 and Social Networks 21/06/2010 UCSP -FASH 3
  • 4. Caracteristics of Web 2.0 • Rich Internet Applications • User generated contents • User enriched contents • User developed widgets • Collaborative environment: Participatory Web, Citizen journalism • Thus, it leverages the power of the Long Tail with user generated data as the driving force • More of a paradigm shift than a technology shift 21/06/2010 UCSP -FASH 4
  • 5. Technology Overview of Web 2.0 • Cascading Style Sheets to aid in the separation of presentation and content • Folksonomies (collaborative tagging, social classification, social indexing, and social tagging) • REST and/or XML- and/or JSON-based APIs • Rich Internet application techniques, often Ajax and/or Flex, Flash- based • Semantically valid XHTML and HTML markup • Syndication, aggregation and notification of data in RSS or Atom feeds • mashups, merging content from different sources, client- and server-side • Weblog-publishing tools • wiki or forum software to support user-generated content 21/06/2010 UCSP -FASH 5
  • 6. Some Web 2.0 Services • Blogs – Blogspot – Wordpress – Lamula (Perú) • Wikis – Wikipedia – Wikiversity • Social Networking Sites – Facebook – Twitter – MySpace – Orkut • Digital media sharing websites – Youtube – Flickr – Vimeo – Twitpic • Social Tagging – Del.icio.us 21/06/2010 UCSP -FASH 6
  • 7. Social Networks • A social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, like, ... • Graphical representation – Nodes = members – Edges = relationships 21/06/2010 UCSP -FASH 7
  • 9. Social Networks • A social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, like, ... • Graphical representation – Nodes = members – Edges = relationships • Various realizations – Social bookmarking (Del.icio.us) – Friendship networks (facebook, myspace) – Blogosphere – Media Sharing (Flickr, Youtube) – Folksonomies 21/06/2010 UCSP -FASH 9
  • 10. BLOGOSPHERE Definitions, Types, and Comparison 21/06/2010 UCSP -FASH 10
  • 11. Blogging Phenomenon • It’s growing fast as a new means for online communications and interactions • A blogger could gain instant fame via his blogs • A blogger may make a good living with her blogs • Abundant, lucrative business opportunities • A new political arena 21/06/2010 UCSP -FASH 11
  • 12. Blog Structure Blog Site 21/06/2010 UCSP -FASH 12
  • 13. Blog Structure Blog Post 21/06/2010 UCSP -FASH 13
  • 14. Blog Structure Blogger 21/06/2010 UCSP -FASH 14
  • 15. Types of Blogs • Individual vs. community – Single authored (Individual blog sites) – Multi authored (Community blog sites) Individual Blog Sites Community Blog Sites Owned and maintained by a group of like-minded Owned and maintained by individual users. users. More like discussion forums and discussion More like personal accounts, journals or diaries. boards. High degree of group discussion and No or almost negligible group interaction. collaboration. Enormous collective wisdom and open source No or almost negligible collective wisdom. intelligence. • Regulated vs. anonymous 21/06/2010 UCSP -FASH 15
  • 16. Blogosphere • Complex Social Networks • Vertices (Nodes): Bloggers/ Blog posts/Blog sites • Edges: Relationships/Links • In-Degree: Number of inlinks • Out-Degree: Number of outlinks 21/06/2010 UCSP -FASH 16
  • 17. Friendship Networks vs. Blogosphere Friendship Networks Blogosphere Explicit Links/Edges Implicit Links/Edges Undirected Graph Directed Graph Network Centrality Measures Blog Statistics Quantifying Spread of Influence Quantifying Influential Members Nodes are members/actors Nodes can be bloggers/blogs or blog sites Strictly defined graph structure Loosely defined graph structure “Being in touch” or “Making Friends” Sharing ideas and opinions Person-to-person Person-to-group Friendship Oriented Community Oriented Member’s Reputation/Trust based on network Member’s Reputation/Trust based on the response connections and/or location in the network to other member’s knowledge solicitations 21/06/2010 UCSP -FASH 17
  • 18. Friendship Networks vs. Blogosphere Social Networks Orkut, Facebook, LinkedIn, Classmates.com, etc. Social LiveJournal, MySpace, etc. Friendship Blogosphere Networks TUAW, Blogger, Windows Live Spaces, etc. 21/06/2010 UCSP -FASH 18
  • 20. Understanding Blogosphere • Blogosphere • Everyone can publish, but • Blog sites few are heard • Bloggers • Many interesting questions • Blog posts to address • Reverse chronologically – How to build traffic ordered entries – How to find niche • Blogroll online • Permalinks – How to increase influence – How to … • Fertile research domain 21/06/2010 UCSP -FASH 20
  • 21. Understanding Blogosphere • Understand structures and properties of Blogosphere • Gain insights into the relationships between bloggers, readers, blog posts, comments, different blog sites in Blogosphere • Models help generate artificial data, tune the parameters to simulate special scenarios, and compare various studies and different algorithms • Study peculiarities in Blogosphere and infer latent patterns and structures that could explain certain phenomena like influence, diffusion, splogs, community discovery. 21/06/2010 UCSP -FASH 21
  • 22. Modeling Web and Blogosphere • Some key differences between Web and Blogosphere – Models developed for Web assume dense graph structure due to a large number of interconnecting hyperlinks within webpages. This assumption does not hold true. Blogosphere is shown to have a very sparse hyperlink structure [Kritikopoulos et al. 2006]. – The level of interaction in terms of comments and replies to blog posts makes Blogosphere different from Web – The highly dynamic and “short-lived” nature of the blog posts could not be simulated by the web models. Web models do not consider dynamicity in the web pages – Web models assume webpages accumulate links over time. However, this is not true with Blogosphere – “Categories” and “tags” gives blogs flexibility that conventional websites typically don’t have – Descriptive filenames used in permalinks of blogs as compared to webpage filenames 21/06/2010 UCSP -FASH 22
  • 23. Modeling Blogosphere • Preferential attachment – Probability of a new edge to a node to be added depends on its degree – “The rich get richer” P(e : vi  v j )  deg( vi ) – Power law distribution or scale free distribution 21/06/2010 UCSP -FASH 23
  • 24. Modeling Blogosphere • Preferential attachment – Probability of a new edge to a node to be added depends on its degree – “The rich get richer” P(e : vi  v j )  deg( vi ) – Power law distribution or scale free distribution 21/06/2010 UCSP -FASH 24
  • 25. Modeling Blogosphere • Preferential attachment P(e : vi  v j )  deg( vi ) / V – Probability of a new edge to a node to be added depends on its degree – “The rich get richer” – Power law distribution or scale free distribution • Hybrid model P(e : vi  v j )   deg( vi ) / V  (1   )  – Mixture of both preferential attachment model and random model – Give a lucky poor guy some chance to get rich – To solve irreducibility (strong connectedness with few isolated subgraphs) random walk on a graph model proposes a random jump with a fixed probability • Leskovec et al. 2007 studied temporal patterns – How often people create blog posts – Busrtiness and popularity – How these posts are linked and what is the link density – Developed a SIS based model • Kumar et al. 2003 use blogrolls on the blog posts to construct a network of blog posts assuming that blogrolls contain similar blog posts 21/06/2010 UCSP -FASH 25
  • 26. Blog Clustering 21/06/2010 UCSP -FASH 26
  • 27. Blog Clustering • Dynamic and automatic organization of the content • Convenient accessibility • Optimizing search engines by reducing search space – Search only the relevant cluster • Focused crawling • Summarization • Topic identification • Reduce information overload – 175,000 blog posts per day, i.e., 2 blog posts per second – Dec 2006 • Extraction and analysis of the trends 21/06/2010 UCSP -FASH 27
  • 28. tfidf i , j  tf i , j  idf i Blog Clustering ni , j tf i , j  • Brooks and Montanez 2006, used tf-idf and  k nk , j picked top 3 keywords for blog posts  log D d : ti  d j  idf – Clustered blogs based on these keywords i j – Reported improved clustering as compared to that using tags • Li et al. 2007 assigned different weights to title, body, and comments of blog posts – Need to address high dimensionality and sparsity due to their keyword-based approach • Agarwal et al. 2008 proposed a collective-wisdom based approach – Generate a category relation graph based on user assignments – Compute similarity matrix from this graph 21/06/2010 UCSP -FASH 28
  • 29. Blog Mining • Interactions between producers and consumers improved with blogs • Consumers not only speak their mind but also broadcast their opinions • Blogs are invaluable information sources – consumers’ beliefs and opinions, – initial reaction to a launch, – understand consumer language, – track trends and buzzwords, and – fine-tune information needs • Blog conversations leave behind the trails of links, useful for understanding how information flows and how opinions are shaped and influenced • Tracking blogs also help in gaining deeper insights 21/06/2010 UCSP -FASH 29
  • 30. Blog Influence • Two types of influence – Influential blog sites and site networks [Gill 2004, Gruhl et al 2004, Java et al 2006] – Influential bloggers in a community [Agarwal et al. 2008] • Blogosphere vs. Friendship Networks – Implicit vs. Explicit links – Blog statistics vs. Centrality measures – “influencing” vs. “could influence” – Loosely vs. Strictly defined graph structures • Blog vs. Webpage Ranking – Blog sites too sparse for webpage ranking algorithms to work [Kritikopoulos et al 2006] – Webpage acquires authority over time, blog posts’ influence diminishes – Greedy approach works better than PageRank, HITS to maximize influence flow [Kempe et al 2003, Richardson & Domingos 2002] 21/06/2010 UCSP -FASH 30
  • 31. Issue of Trust • Open standards and low barriers to publishing have created overwhelming amount of collective wisdom • Yet more difficult for readers to discern whom to trust in some cases • Similar to WWW – Authoritative webpages e.g., HITS [Kleinberg et al. 1998], PageRank [Page et al. 1999] • Blogosphere allow mass to create and edit content compromising the sanctity of the original content • Some work exists for social friendship network domain, not many researchers have explored Blogosphere • Huge potential for trust study in Blogosphere domain 21/06/2010 UCSP -FASH 31
  • 32. Trust • Kale et al. 2007 transformed the problem of trust in blogosphere to the one in social friendship networks – Studied propagation of trust among different blog sites – Mined sentiments from a window of words around hyperlinks – Identified positive, negative, or neutral sentiments towards the linked blog site – Constructed a network of blog sites using hyperlinks – Used Gruhl et al. 2004 trust propagation algorithm – Some concerns • These blog sites have to be linked for trust propagation • Trust is computed between blog sites based on how much one blog agrees or disagrees with the other Mi+1 = Mi * Ci – Perform till convergence M = Belief Matrix; Ci = Atomic Propagation Ci = M + MT*M + MT + M*MT 21/06/2010 UCSP -FASH 32
  • 33. Community Extraction • Blogosphere doesn’t have an explicit notion of communities • Different from blog clustering • Researchers identify communities based on – Links: network of hyperlinks allows identification of virtual communities • Several studies on finding community of webpages like Kleinberg 1998 and Kumar et al. 1999 • While Kleinberg used authority and hubs idea to explore communities of webpages, Kumar et al. extended the idea of hubs and authorities and included co-citations as a way to extract all communities on the web and used graph theoretic algorithms to identify all instances of graph structures that reflect community characteristics. – Content: blogs with similar content or inspired by the same event form a virtual community • Kumar et al. 2003, Efimova and Hendrick 2005, Blanchard 2004 21/06/2010 UCSP -FASH 33
  • 34. Community Extraction • Chin and Chignell 2006 proposed a model for finding communities taking the blogging behavior of bloggers into account – They aligned behavioral approaches through blog reader survey in studying blog community. • Blanchard and Marcus 2004 studied a multiple sport newsgroup “Virtual Settlement” and analyzed the possibility of emerging virtual communities – Newsgroups and discussion forums are similar in terms of interaction patterns to Blogosphere – More person-to-group interaction rather than person-to-person interaction 21/06/2010 UCSP -FASH 34
  • 35. Spam blog (Splogs) Filtering • One of the major rising concerns on Blogosphere • Spammers make most of their money by getting viewers to click on ads that run adjacent to their nonsensical text • Open standards and low barriers to publishing escalates the problem and challenges while solving • Besides degrading search quality, affects the network resources 21/06/2010 UCSP -FASH 35
  • 36. Spam blog (Splogs) Filtering • One of the major rising concerns on Blogosphere • Open standards and low barriers to publishing escalates the problem and challenges while solving • Besides degrading search quality, affects the network resources • Initial researches applied web spam link detection approaches – Ntoulas et al. 2006, distinguish between normal web pages and spam webpages based on the statistical properties like • number of words, average length of words, anchor text, title keyword frequency, tokenized URL – Gyongyi et al. 2004, Gyongyi et al. 2006 use PageRank to compute the spam score of a webpage • Kolari et al. 2006, consider each blog post as a static webpage and use both content and hyperlinks to classify a blog post as spam using a SVM based classifier 21/06/2010 UCSP -FASH 36
  • 37. Tools and API’s Working in the Blogosphere… 21/06/2010 UCSP -FASH 37
  • 38. Analysis and Visualization Tools • Tools – Data Analysis & Visualization tools – Statistics like centrality measures • NetLogo (http://ccl.northwestern.edu/netlogo/) – Multi-agent programming language and modeling environment designed in Logo – Modelers can give instructions to hundreds or thousands of concurrently operating autonomous agents. – Exploring the connection between the individuals (micro-level) and the patterns that emerge from the interaction of many individuals (macro-level). 21/06/2010 UCSP -FASH 38
  • 39. Analysis and Visualization Tools • UCINet (http://www.analytictech.com/) – Package for the analysis of social network data including centrality measures, subgroup identification, role analysis, elementary graph theory, and permutation-based statistical analysis – Has strong matrix analysis routines, such as matrix algebra and multivariate statistics • Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/) – Slovenian for spider – Analyzing and visualizing large networks like social networks • Network package in R (http://cran.r-project.org/src/contrib/Descriptions/network.htm) – The network class can represent a range of relational data types, and support arbitrary vertex/edge/graph attributes – This is used to create and/or modify the network objects and is used for social network analysis (SNA) 21/06/2010 UCSP -FASH 39
  • 40. Analysis and Visualization Tools • InFlow (http://www.orgnet.com/inflow3.html) – Integrated product for network analysis and visualization – Used in the SNA domain • NetMiner (http://www.netminer.com/) – Tool for exploratory network data analysis and visualization – NetMiner allows to explore network data visually and interactively, and helps in detecting underlying patterns and structures of the network 21/06/2010 UCSP -FASH 40
  • 41. APIs • APIs – Data collection (blog posts, inlinks, tags, etc.) – Technorati – Digg – del.icio.us – Facebook – StumbleUpon 21/06/2010 UCSP -FASH 41
  • 42. Technorati API • bloginfo query API url: http://api.technorati.com/bloginfo?key=[apikey]&url=[blog url] Sample response: <result> <url>[URL]</url> <weblog> <name>[blog name]</name> <url>[blog URL]</url> <rssurl>[blog RSS URL]</rssurl> <atomurl>[blog Atom URL]</atomurl> <inboundblogs>[inbound blogs]</inboundblogs> <inboundlinks>[inbound links]</inboundlinks> <lastupdate>[date blog last updated]</lastupdate> <rank>[blog ranking]</rank> <lang></lang> <foafurl>[blog foaf URL]</foafurl> </weblog> </result> 21/06/2010 UCSP -FASH 42
  • 43. Technorati API • BlogPostTags query API url: http://api.technorati.com/blogposttags?key=[apikey]&url=[blog url] Sample response: <document> <result> <querycount>[limit parameter]</querycount> </result> <item> <tag>[tag name];/tag> <posts>[tag count]</posts> </item> </document> 21/06/2010 UCSP -FASH 43
  • 44. del.icio.us API https://api.del.icio.us/v1/tags/get Returns a list of tags and number of times used Sample response <tags> <tag count="1" tag="activedesktop" /> <tag count="1" tag="business" /> <tag count="3" tag="radio" /> <tag count="5" tag="xml" /> <tag count="1" tag="xp" /> <tag count="1" tag="xpi" /> </tags> 21/06/2010 UCSP -FASH 44
  • 45. Data Collection Using the Blogosphere… 21/06/2010 UCSP -FASH 45
  • 46. Available Datasets • TREC (http://ir.dcs.gla.ac.uk/test_collections/blog06info.html) – A crawl of Feeds, and associated Permalink and homepage documents (from late 2005 and early 2006) – 100,649 feeds were polled once a week for 11 weeks – Total Number of Feeds collected:753,681 – Average feeds collected every day:10,615 – Uncompressed Size:38.6GB Compressed Size:8.0GB – Reasonably sized spam component for added realism – Fee: £400 ~ $794.36 21/06/2010 UCSP -FASH 46
  • 47. Available Datasets • Mobile Network (http://kdl.cs.umass.edu/data/msn/msn-info.html) – 27 objects – over 180,000 links – 1 object attribute – 2 link attributes • Other ways – Crawl blogs – Blogcatalog – Statistics available from technorati API – Tagging available from del.icio.us API 21/06/2010 UCSP -FASH 47
  • 48. Data Crawler • BlogTrackers – User interface to crawl blog sites • Scratch crawling (from blog archives) • Incremental crawling (from RSS feeds) – Stores the blog posts in Microsoft SQL server – Collects Blog post title Blog post tags Blog post content Blog post permalink Outlinks Blogger name Inlinks Blog post date and time – Track blog posts like generate tag clouds for user specified time window 21/06/2010 UCSP -FASH 48
  • 49. Collectable Statistics from Blogs • Inbound links – Blogs, blog post, webpage • Outbound links – Blogs, blog post, webpage • Comments • Blog server logs • Subscribers • Time to read/length • Links to post and incoming traffic from them • Links from post and outgoing traffic to them • Topic frequency score • Blogroll links • Tagged urls (del.icio.us, furl) 21/06/2010 UCSP -FASH 49
  • 50. Searching The Influentials : The Top Bloggers • Active bloggers – Easy to define – Often listed at a blog site – Are they necessarily influential • How to define an influential blogger? – Influential bloggers have influential posts – Subjective – Collectable statistics – How to use these statistics 21/06/2010 UCSP -FASH 50
  • 51. Intuitive Properties • Social Gestures (statistics) – Recognition: Citations (incoming links) – An influential blog post is recognized by many. The more influential the referring posts are, the more influential the referred post becomes. – Activity Generation: Volume of discussion (comments) – Amount of discussion initiated by a blog post can be measured by the comments it receives. Large number of comments indicates that the blog post affects many such that they care to write comments, hence influential. – Novelty: Referring to (outgoing links) – Novel ideas exert more influence. Large number of outlinks suggests that the blog post refers to several other blog posts, hence less novel. – Eloquence: “goodness” of a blog post (length) – An influential is often eloquent. Given the informal nature of Blogosphere, there is no incentive for a blogger to write a lengthy piece that bores the readers. Hence, a long post often suggests some necessity of doing so. • Influence Score = f(Social Gestures) 21/06/2010 UCSP -FASH 51
  • 52. Understanding the Influentials • Are influential bloggers simply active bloggers? • If not, in what ways are they different? – Can the model differentiate them? • Are there different types of influential bloggers? • What other parameters can we include to evolve the model? • Are there temporal patterns of the influential bloggers? 21/06/2010 UCSP -FASH 52
  • 53. Active & Influential Bloggers • Active and Influential Bloggers • Inactive but Influential Bloggers • Active but Non-influential Bloggers • They don’t consider “Inactive and Non-influential Bloggers”, because they seldom submit blog posts. Moreover, they do not influence others. 21/06/2010 UCSP -FASH 53
  • 54. Conclusions… Blogosphere is one of the fastest growing, social networking media. The virtual communities in the blogosphere are not constrained by physical proximity and allow anytime, anywhere, and instant communications. In this paper the autors discuss current research issues in Blogosphere including modeling, blog clustering, blog mining, community discovery and factorization, influence and propagation, trust and reputation, and filtering spam blogs. 21/06/2010 UCSP -FASH 54
  • 55. Questions 21/06/2010 UCSP -FASH 55