Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Webometrics Revisited in Big Data Age_DISC2013
1. Virtual Knowledge Studio (VKS)
“Webometrics Studies” Revisited
in the Age of ―Big Data‖
Asso. Prof. Dr. Han Woo PARK
CyberEmotions Research Institute
Dept. of Media & Communication
YeungNam University
214-1 Dae-dong, Gyeongsan-si,
Gyeongsangbuk-do 712-749
Republic of Korea
www.hanpark.net
cerc.yu.ac.kr
eastasia.yu.ac.kr
asia-triplehelix.org
2. Big data
The term ―big data‖ refers to ―analytical technologies that
have existed for years but can now be applied faster, on a
greater scale and are accessible to more users. (Miller,
2013).
Big data sizes may vary per discipline.
Characteristics: Garner‘s 3Vs plus SAS‘s VC and IBM‘s
Veracity
- Volume (amount of data), Velocity (speed of data in and
out), Variety (range of data types and sources)
- Variability: Data flows can be highly inconsistent with daily,
seasonal, and event-triggered peak data loads
- Complexity: Multiple data sources requiring cleaning,
linking, and matching the data across system
- Veracity: 1 in 3 business leaders don‘t trust the information
they use to make decisions.
http://en.wikipedia.org/wiki/Big_data
http://www-01.ibm.com/software/data/bigdata/
5. Data-driven Research that focuses on
extracting meaningful data from technosocio-economic systems to discover
some hidden patterns.
Today‘s ―big‖ is probably tomorrow‘s ―medium‖ and
next week‘s ―small‖ and thus the most effective def
inition of ―big data‖ may be derived when the size of
data
itself becomes part of the research problem.
6. Introduction
Webometrics is broadly defined as the study of
web-based content (e.g., text, images, audio-visual
objects, and hyperlinks) with primarily quantitative
indicators for social science research goals and
visualization techniques derived from information
science and social network analysis.
7. • Han Woo Park
- “hidden” and “relational” data about
lots of people as well as the few
individuals, or small groups
• Lev Manovich
- ―surface‖ data about lots of people (i.e.,
statistical, mathematical or computational
techniques for analyzing data)
- ―deep‖ data about the few individuals or small
groups (i.e., hermeneutics, participant
observation, thick description, semiotics, and
close reading)
7
8. First type of Webometrics
• Hyperlink Network Analysis
-
Inter-linkage: who linked to whom matrix
Co-inlink: a link to two different nodes from a third node
Co-outlink: A link from two different nodes to a third node
Björneborn (2003)
9. Inter-link network analysis diagram among Korean escience sites within public domain
WCU
WEBOMETRICS
INSTITUTE
Mapping the e-science landscape
In South Korea using the Webometrics method
11. Findings
As seen in Figure 4, the network structure shows a clear butterfly pattern. There is one hub (ghism)
that belongs to Park Gyun-Hye (Park GH, www.cyworld.com/ghism), the daughter of ex-president
Park Jeong-Hee and one of two major GNP candidates (along with president-elect Lee MB) in the
2007 presidential race.
Figure 4: Cyworld Mini-hompies of Korean legislators
How do social scientists use link data
from search engines to understand
Internet-based political and electoral
communication?
WCU
WEBOMETRICS
INSTITUTE
INVESTIGATING INTERNET-BASED POLITICS WITH E-RESEARCH TOOLS
Case 2. Cyworld Mini-hompies of Korean Legislators
12. Sociology of Hyperlink Networks of Web 1.0,
Web 2.0, and Twitter
A Case Study of South Korea
13. Introduction
‣ Online & offline lives ➭ co-constructing (e.g. Beer & Burrows, 2007)
‣ Politicians communicate with their constituencies using different platforms
‣ Questions:
- What are the structural similarities and/or differences in South Korean
politicians‘ networks from Web 1.0 to Web 2.0 (and Twitter)?
- Are online structures similar to structures in the physical world?
- Are online patterns affected by offline relationships?
‣ Related studies conducted:
- online social network analysis
- online networks in Web 2.0
- role of Twitter on online politics
14. 2001
2000
‣ 59 isolated in 2000
‣ more centralised in 2001
‣ network of 2001 ➭ a ‗star‘ network
- might affected by political events
➭ presidential election in 2001
Web 1.0
15. 2005
2006
‣hubs disappearing
‣easy use of blogs
‣Clear boundaries between different parties
‣strong presence of GNP Assembly members
➭ party policy on using blogs
Web 2.0
18. Bi-linked network of politically active
A-list Korean citizen blogs (July 2005)
URI=Centre
DLP=Left
GNP=Right
Just A-list blogs exchanging links with politicians
19. Affiliation network diagram using pages
linked to Lee’s and Park’s sites
N = 901 (Lee: 215, Park: 692, Shared: 6)
23. ―Those studies perpetuate the idea that linking behaviour
is not random, and that links are ‗socially significant in
some way‘. In this perspective, links have an
‗information side-effect‘, they can be used to
understand other facts even though they were not
individually designed to do so: ‗information side-effects
are by-products of data intended for one use which
can be mined in order to understand some tangential,
and possibly larger scale, phenomena‘
24. Park and his colleagues were
extensively cited: 9 times!
•
•
•
•
•
•
•
•
•
Barnett GA, Chung CJ and Park HW (2011) Uncovering transnational hyperlink patterns
and web mediated contents: a new approach based on cracking.com domain. Social
Science Computer Review 29(3): 369–384.
Hsu C and Park HW (2011) Sociology of hyperlink networks of Web 1.0, Web 2.0, and
Twitter: a case study of South Korea. Social Science Computer Review 29(3): 354–368.
Park HW (2003) Hyperlink network analysis: a new method for the study of social structure
on the web. Connections 25(1): 49–61.
Park HW (2010) Mapping the e-science landscape in South Korea using the webometrics
method. Journal of Computer-Mediated Communication 15(2): 211–229.
Park HW and Jankowski NW (2008) A hyperlink network analysis of citizen blogs in South
Korean politics. Javnost: The Public 15(2): 5–16.
Park HW and Thelwall M (2003) Hyperlink analyses of the World Wide Web: a review.
Journal of Computer-Mediated Communication 8(4).
Park HW and Thelwall M (2008) Developing network indicators for ideological landscapes
from the political blogosphere in South Korea. Journal of Computer-Mediated
Communication 13(4): 856–879.
Park HW, Kim C and Barnett GA (2004) Socio-communicational structure among political
actors on the web in South Korea. New Media & Society 6(3): 403–423.
Park HW, Thelwall M and Kluver R (2005) Political hyperlinking in South Korea: technical
indicators of ideology and content. Sociological Research Online 12(3).
25. A comment from those who are
NOT doing a hyperlink analysis
• In a chapter of The Sage Handbook of
Online Research Methods edited by
Fielding et al. (2008), Horgan emphasizes
that ‗link analysis‘ has become an active
research domain in examining social
behavior online.
25
26. A threat to Webometrics
• The key application in this area is to collect
some incoming, outgoing, inter-linking, and
co-linking data from search engines
- AltaVista in early 2000
- Yahoo renewed the AltaVista‘s hyperlink
commands via ―Site Explorer‖ and its API
- Yahoo discontinued its API option for
interlinkage data in April 2011, and finally
stopped its popular Site Explore service in
November 2011
28. A new proposal
• Mike Thelwall
- URL citation searches with the Bing search
API facilities
• Liwen Vaughan
- Incoming hyperlinks from Alexa.com
Can these "alternative" techniques be
acceptable for scientific publishing?
29. A new proposal : SEO Tools
•
-
Search Engine Optimization Tools
http://www.majesticseo.com/
http://www.opensiteexplorer.org/
https://ahrefs.com/
Enrique Orduña-Malea & John J.
Regazzi (2013). Influence of the academic
Library on U.S. university reputation:
a webometric approach. Technologies. 1, 2643, http://www.mdpi.com/2227-7080/1/2/26
30. Webometrics Ranking of
World Universities
The link visibility data is collected from the two most
important providers of this information: Majestic
SEO and ahrefs.
Both use their own crawlers, generating different
databases that should be used jointly for filling
gaps or correcting mistakes.
The indicator is the product of square root of the
number of backlinks and the number of
domains originating those backlinks, so it is not
only important the link popularity but even more
the link diversity.
The maximum of the normalized results is the impact
indicator.
http://www.webometrics.info/en/Methodology
31. Interlinkage among world universities
• Barnett, G.A., Park, H. W., Jiang, K., Tang, C.,
& Aguillo, I. F. (2013 forthcoming). A MultiLevel Network Analysis of Web-Citations
Among The World‘s Universities.
Scientometrics*.
Isidro F. Aguillo
―Large interlinking matrix (1000*1000) are no
longer possible to obtain. Perhaps national
academic systems (200 or 300 institutions)‖
32. Intentional inattention
among Information Scientists?
• Robert Ackland (2013). Web Social Science.
- http://voson.anu.edu.au/
• Richard Rogers (2013). Digital Methods.
- https://www.issuecrawler.net/index.php
- https://www.digitalmethods.net/Dmi/ToolDa
tabase
33. Let us move to Web Visibility Analysis
Frequently occurring key words in e-science webpages in Korea
Created on Many Eyes(http://many-eyes.com)
Words are larger according to the frequency of their occurrence but their
positions are randomly-chosen for the best visualization
WCU
WEBOMETRICS
INSTITUTE
34. Websites retrieved more than two times
Note: Websites are larger according to their frequency of retrieval; however, heir
colors and locations are randomly-chosen for the best visualization
WCU
WEBOMETRICS
INSTITUTE
35. 2nd type of Webometrics: Web Visibility
Web visibility as an indicator of online political power
Presence or appearance of actors or issues being
discussed by the public (Internet users) on the web.
Tracking web visibility is powerful way to get an insight
into public reactions to actors or issues.
Recent studies indicates the positive relationships
between politicians‘ web visibility level and election.
Also, the co-occurrence web visibility between two
politicians represents their hidden online political
relationships based on the public perception.
39. e-리서치 도구의 활용: 웹가시성 분석
블로그 공간에서 후보자들의 웹가시성 수준과 득표 수간
에 밀접한 상관성을 나타냄. (임연수, 박한우, 2010,
JKDAS)
실제 득표수
29,120
평균 블로그 수
19,427
14,218
3,071 2,125
504
경대수 정범구 정원헌 박기수 이태희 김경회
41. I.
소셜 미디어의 특징 및 영향력
10.26 재보궐 선거 사례
•
(2)
페이스북에서 이름이 동시에 언급되는 이름 연결망을 구
성하여 분석
•
초반에는 두 후보자가 비슷하게 언급되다가,
중반에 접어들자 박원순 지지자들과 박원순이 언급되면서
나경원 후보자 지지자가 안보이게 되고,
종반에는 박원순 중심으로 네트워크가 재편되며 종결됨
42. I.
Semantic network에서 중심성 비교
10.26 재보궐 선거 사례
(2)
•
서울시장 선거 관련 메세지들의 내
용을 분석하여 나오는 단어들의 빈
도 분석
•
초반부터 나경원 후보는 빈도가 떨
어지다가, 후반에 박원순 후보와 경
쟁 및 선거 결과를 이야기하면서 나
타나는 경우를 제외하고는 줄곳 담
론외곽에 존재
•
안철수 효과는 초반에 크고, 중반이
후 떨이지는 효과가 나타났으나, 한
나라당이라는 언급이 높게 나오면
서 집권여당에 반하는 정서가 나타
나, 선거의 성격을 말해줌
43.
As Lim & Park (2011, 2013)
claim, the use of web
mentions of politicians‘
names is particularly useful
for hierarchically ranking
individual politicians.
However, it may not
sufficiently capture the
entropy probability of an
event (hidden in changing
communication structures)
resulting from the amount of
information conveyed by the
occurrence of that event
(Shannon, 1948).
44.
Taleb (2012) argues that society
can be conceived as a complex
fabric consisting of the extended
disorder family including
uncertainty, chance, entropy, etc.
Therefore, such disorder system
can be better derived from
empirical data mining, not
obtained by a priori theorem.
Uncertainty exists when three or
more events take place
simultaneously and is
increasingly beyond the control of
individual events
(Leydesdorff, 2008).
45.
In social and communication
sciences, entropy-based
indicators have been widely
used for exploring entropy
values generated from
university-industrygovernment (UIG)
relationships.
This ―Triple Helix Model‖
(THM) can be applied to
the concurrence of a pair
of two or three terms in
the public search engine
database
46. Mapping Election Campaigns Through Negative Entropy:
Triple and Quadruple Helix Approach
to Korea’s 2012 Presidential Election
Social media platforms have become a notable venue for Korean
voters wishing to share their opinions and predictions with others
(Park et al., 2011; Sams & Park, 2013).
Politicians have made increasingly use of SNSs to provide updates
and communicate with citizens (Hsu & Park, 2012).
With the increasing proliferation of smartphones and portable
computers in Korea, SNSs have been widely used for facilitating
political discourse.
Prior studies have found that Web 1.0 contents tended to contain the
more enduring political and electoral statements of the public in
various contexts.
47. Introduction
To better understand the dynamics of the 2012 presidential election
in Korea, this study estimates the web visibility of the three major
candidates— Geun-Hye Park (PARK), Cheol-Soo Ahn (AHN), and
Jae-In Moon (MOON)—in the entire digital sphere.
48. Literature Review
The total probabilistic entropy (uncertainty) produced by changes in one or
two dimensions is always positive, which is in accordance with the second
law of thermodynamics (Theil, 1972, p. 59).
On the other hand, the relative contribution of each event to the
summation in three or four dimensions can be positive, zero, or negative
(configurational information).
This configurational information provides a measure of synergy within a
complex communication system. Network effects occur in a systemic and
nonlinear manner when loops in the configuration generate redundancies
in relationships between three or four events (Leydesdorff, 2008).
49. Method: Data collection
The number of hits for each search query per media
channel (Facebook, Twitter, and Google) was harvested.
The hit counts obtained from Google.com were
employed to look primarily at entropies represented on a
set of digitally accessible documents (e.g., online
versions of newspapers, online word-of-mouth, Web 1.0
contents, etc.).
We measured the occurrence and co-occurrence of the
politicians‘ names based on their bilateral, trilateral, and
quadruple relationships by using Boolean operators.
For example, we measured the number of web and
social media mentions referring only to PARK (this is, no
mention of AHN, MOON, or the term ―president‖).
51. Literature Review
Twitter can be very effective to amplify messages particularly in terms of their
one-to-many mode of communication (Barash & Golder, 2010).
Twitter is viable both as a political news and communication channel
(González-Bailón, Borge-Holthoefer, Rivero & Moreno, 2011; Hsu &
Park, 2011, 2012; Otterbacher, Shapiro, & Hemphill, 2013)
and to citizens who look for platforms for political participation and engagement
(Hsu, Park, & Park, 2013; Kim & Park, 2011; Tufekci& Wilson, 2012).
52. Literature Review
The mode of information sharing on Facebook differs from that on Twitter.
Facebook functions as a living room where friends talk to one another.
Facebook can be a mixture of interpersonal and mass channels for the sharing of
informational as well as social messages in a context of political campaign (Bond
et al., 2012; Effing, van Hillegersberg, & Huibers, 2011; Robertson, Vatrapu, &
Medina, 2010; Vitak et al., 2011).
Both Twitter and Facebook communications seem to be biased because two
platforms have been particularly dominated by the ―2040 Generation‖, who are
generally categorized as political liberals in Korea (Kwak et al., 2011).
53. Research questions
Therefore, it is important to examine what (social) media
conversations are more likely to generate more entropies that
others and which politician:
RQ 1) What (social) media generate (negative) entropy more than
others across different periods?
RQ 2) Which politician (or which pair of politicians) generates
entropy more than others for bilateral, trilateral, or quadruple
relationships across various media and periods?
55.
Entropy values (expressed as T for transmission)
for bilateral relationships are, by
definition, positive. Here T is defined as the
difference in uncertainty when the probability
distributions of two incidents (e.g., i and j) are
combined. The mutual information transmission
capacity, expressed in T values, is measured by
―bits‖ of information (for a more detailed
mathematical definition, see Leydesdorff, 2003):
Hi = – Σi pi log2 (pi); Hij = – Σi Σj pij log2 (pij),
Hij = Hi + Hj – Tij ,
Tij = Hi + Hj – Hij
(1)
Here Tij is zero if the two distributions are mutually
independent and positive otherwise (Theil, 1972).
56.
On the other hand, T values for trilateral and quadruple
relationships can be negative, positive, or zero depending on the
size of contributing terms. Therefore, it is necessary to compare
the absolute value of each (negative) entropy value when entropy
values are calculated for trilateral and quadruple relationships. In
the case of entropy values for trilateral and quadruple
relationships, the higher the absolute entropy value, the more
balanced the communication system is. Let p denote PARK;
a, AHN; and m, MOON and formulate mutual information in these
three dimensions as follows (Abramson. 1963, p. 129):
Tpam = Hp + Ha + Hm – Hpa – Hpm – Ham + Hpam
Here we are interested not only in information on mutual
relationships between these three candidates but also in semantic
relationships with respect to the term ―president.‖ Accordingly, we
measure the entropy value by using mutual information in these
four dimensions (here ―r‖ denotes ―president‖):
Tpamr = Hp + Ha + Hm + Hr – Hpa – Hpm – Hpr – Ham – Har – Hmr +
Hpam + Hpar + Hpmr + Hamr –Hpamr
(3)
(2)
60. Discussion and conclusions
Twitter has scored the most negative entropy
values and Facebook followed. Google came last.
This indicates that Twitter is the most open
communication system.
The entropy values for liberal candidates (AHN and
MOON) have been higher than their conservative
opponent PARK on social media than Google
sphere.
This may not be surprising because both Twitter
and Facebook have particularly appeared to the
Korean citizens in the age of late teenagers to
early 40s.
61. Discussion and conclusions
PARK‘s entropy has been slightly higher on
Google than her liberal challenger MOON.
Park was successful in garnering a strong support
from senior voters in their 50s and 60s accounted
for 39% of the population, up from 29% a decade
ago (Wall Street Journal, 2012).
Exit poll also revealed that PARK gained a support
from 62% of voters in their 50s and 72% of voters
in their 60s. Indeed, the most significant statistic on
the election was that South Koreans in their
20s, 30s, and 40s actually voted
65.2%, 72.5%, and 78.7% respectively but 89.9%
in 50s and 78.8% over 60s went to the polling
booth.
Notes de l'éditeur
It could be more intuitive to see through graphics.We depicted Politicians’ Twitter network. We have drawn the mention network over the following-follower network(explanation, if necessary)
(conclusion)
As shown in Figure 4, Park’s network suggests that users constructed an organized and hierarchical issue network. In addition, productive users who continuously participated in the issue network played a role as hubs in terms of user interactions. Red nodes indicate those users who were consistently engaged in Park’s network. These users had more communication power and were more centrally positioned in the network than other users who temporarily participated in the issue network. In other words, their Tweets were more likely to be retweeted and induce responses by others. In Park’s issue network, 7,103 users generated 8,018 retweets, 122 replies, and 22 mentions. Noteworthy is that,as shown in Figure 5,Lee’s issue network was unique in terms of its topology. As shown in Table 2, the high clusterability of the network indicate that users who tweeted about Lee formed an extremely cohesive network and were more connected to one another than those in Park’s and Moon’s networks. In Lee’s network, a total of 6,292 users produced 7,561 retweets, 208 replies, and 48 mentions. As shown in Figure 6, Moon’s issue network indicates that those users with more communication power were not necessarily productive, which differs from the case of Park’s network. In Moon’s network, a total of 5,328 users generated 5,707 retweets, 78 replies, and 24 mentions. Given that he was the major opposition candidate against Park and that he had strong public support comparable to that for Park, there were substantial differences in the number of users and user interactions between Moon and Park (The Press, 2012).