Mapping (big) data science (15 dec2014)대학(원)생

Mapping (Big) Data-
Research and Issues
Virtual Knowledge Studio (VKS)
박한우 교수
영남대 언론정보학과
영남대 사이버감성연구소
아시아트리플헬릭스 학회장
대구경북소셜미디어포럼
TEDxPalgong ( 전 )
네델란드왕립아카데미 ( 전 )
옥스퍼드인터넷연구소 ( 전 )
hanpark@ynu.ac.kr
www.hanpark.net

빅데이터의 개념 및 특징빅데이터의 개념 및 특징
데이터 사이언스 배경데이터 사이언스 배경
( 빅 ) 데이터 R&D 동향( 빅 ) 데이터 R&D 동향
사회적 이슈 및 시사점사회적 이슈 및 시사점
1
.
3.
4.
2.
[ 목차 ]

Big data
 Big data usually includes data sets with
sizes beyond the ability of commonly-used
software tools to capture, manage, and
process the data within a tolerable elapsed
time.
 Big data sizes may vary per discipline.
 Characteristics: Garner’s 3Vs plus SAS’s VC
- Volume (amount of data), Velocity (speed of
data in and out), Variety (range of data
types and sources)
- Variability: Data flows can be highly
inconsistent with daily, seasonal, and event-
triggered peak data loads
- Complexity: Multiple data sources requiring
cleaning, linking, and matching the data
across systems.
http://en.wikipedia.org/wiki/Big_data

http://ec.europa.eu/enterprise/policies/innovation/policy/busines

http://www.youtube.com/watch?v=G3XoEGHQbrA&list=UUGrraKQiTF-ml0KqPQ8mrUA

Data-driven Research that focuses
on extracting meaningful data from
techno-socio-economic systems to
discover some hidden patterns.

Today’s “big” is probably tomorrow’s “medium” and
next week’s “small” and thus the most effective defini-
tion of “big data” may be derived when the size of data
itself becomes part of the research problem.
Loukides (2012)
Big data sizes may vary perdiscipline.

Big Data and Social Webometrics Network
Analysis
Big Data and Social Webometrics Network
Analysis
Increasing data size in terms
of the no. of nodes
Micro ≦100 nodes →10K
Meso ≦1000 nodes →1000K
Macro ≦10000 nodes
→100,000K
Super-
Macro
≥10000 nodes → ∽
출처 : 박한우 (2014)

http://www.clickz.com/clickz/news/23369

Data Insights: New Ways to Visualize and Make Sense of
Data , 2012 by Hunter Whitney

http://www.amazon.com/Data-Insights-Ways-Visualize-Sense/dp/0123877938

http://www.slideshare.net/MartinKaltenboeck/introduction-open-government-data
http://www.youtube.com/watch?v=ga1aSJXCFe0&feature=player_embedded

http://home.jtbc.joins.com/Vod/VodView.aspx?epis_id=EP10021807

서울시 , 빅데이터 심야버스 노선 구축

“Data Science” refers to “a discipline that incorporates
varying elements and builds on techniques and theories
from many fields, including data visualization with the goal
of extracting meaning from data and creating data
products.”
http://en.wikipedia.org/wiki/Data_science

Origin of Data Science
Park, H. W., & Leydesdorff, L. (2013 Work-In-Progress). Decomposing a Data-Driven Science Using a Scientometric Method.
 One is Peter Naur’s 1974 book “Concise Survey of Computer
Methods”, a survey of contemporary data processing methods in a wide
range of applications (Gilpress, 2012).
 The other is when the term “big data” first appeared in 1970 in the
Scopus database (Halevi and Moed, 2012). There was no particular key
milestone since 1970s.
 During the 1990s period, the term had been usually related to
computer modeling and software development for large datasets.
Knowledge Discovery and Data Mining in 1997. Rousseau (2012) also
regards the 1993 publication as the first documents indexed in the Web
version of Web of Science.

A more recent development was made with the
establishment of journals that included the term “Data Science”
in their titles:
•Data Science Journal in 2002
•Journal of Data Science in 2003
•EPJ Data Science in 2012
•GigaScience gigasciencejournal.com in 2012
•Big Data & Society in 2015

http://bigdatasoc.blogspot.kr/2014/11/celebrating-official-launch-of-big-data.html?spref=fb

http://bigdatasoc.blogspot.co.uk/

Science published a special
issue (February 11, 2011) looking
broadly at increasingly data-driven
research efforts as a scientific
domain (Science staff, 2011).
Data Science is composed of interrelated
clusters of research tasks. For example, the
technologies on data collection, curation,
and access, and the unique skill sets have
increasingly been central to Data Science
(Science staff, 2011).

An international conference called “Data Science
Summit” (http://www.greenplum.com/datasciencesummit).

http://novaspivack.typepad.com/nova_spivacks_weblog/2007/02/steps_towards_a.html 에서 재인용

All models are wrong but some are useful
Emergence of data author on dataverse

Andersons claims
 Data is everything we need.
 We don't have to settle for models.
 Agnostic statistics.
 Out with every theory of human behavior.
 This approach to science — hypothesize, model,
test — is becoming obsolete.
 Petabytes allow us to say: "Correlation is
enough." We can stop looking for models.
 What can science learn from Google? E-Science.

Big data and the end of theory?
 Does big data have the answers? Maybe some, but not all, says -
Mark Graham
 In 2008, Chris Anderson, then editor of Wired, wrote a provocative
piece titled The End of Theory.Anderson was referring to the ways
that computers, algorithms, and big data can potentially generate
more insightful, useful, accurate, or true results than specialists or
domain experts who traditionally craft carefully targeted
hypotheses and research strategies.
 We may one day get to the point where sufficient quantities of big
data can be harvested to answer all of the social questions that
most concern us. I doubt it though. There will always be digital
divides; always be uneven data shadows; and always be biases in
how information and technology are used and produced.
 And so we shouldn't forget the important role of specialists to
contextualize and offer insights into what our data do, and maybe
more importantly, don't tell us.
http://www.guardian.co.uk/news/datablog/2012/mar/09/big-data-theory

Graham, M., Hale, S.A & Gaffney, D. (2014). Where in the world are you?
Geolocation and language identification in Twitter. Professional
Geographer. 66(4).http://www.tandfonline.com/doi/abs/10.1080/00330124.2014.907699#.VGnmIvms
X0d
Number of geotagged tweets per country between 10 November 10 and 16
December 2011.

Computational (Social) Science
 Focus on the methodological perspective based on
the use of new digital tools to manage the data
deluge.
 Development of e-science tools to automate
research process.
 Experimentation with new types of data
visualization.

http://participatorysociety.org/wiki/index
.php?title=Online_Research

Why Data Science?
Savage and Burrows (2007, p.
886) lament, “Fifty years ago,
academic social scientists might
be seen as occupying the apex
of the – generally limited – social
science research ‘apparatus’.
Now they occupy an increasingly
marginal position in the huge
research infrastructure”.
Bonacich, P. (2004).
The Invasion of the Physicists. Social Networks 26(3): 285-288

http://bds.sagepub.com/content/1/1/2053951714540280.full

http://www.bbc.com/news/uk-22007058

This approach to science is attributed to the late Jim Gray,
one of the most influential computer scientists, at Microsoft.

http://www.oii.ox.ac.uk/research/projects/?
id=98

Global Communication 2team
( 빅 ) 데이터과학의
도전
이론의 종말 - 증거기반 경
영
Jeffrey Pfeffer, Robert I. Sutton (2006)
How companies can bolster performance and trump the
competition through evidence-based management, an
approach to decision-making and action that is driven by
hard facts rather than half-truths or hype.
· 빅데이터의 등장으로 전통
적인 과학 연구방법론 퇴색
· 인식의 한계치를 넘어선
데이터 ( 팩트가 아닌 패
턴 )

http://www.datacenterknowledge.com/archives/2011/09
/23/the-lessons-of-moneyball-for-big-data-analysis/
Common Biases in Data Analysis
It’s easy to develop
“affirmation bias,” DePodesta
said. “Once we’ve made up our
minds, we resist information
that doesn’t agree with our
conclusion,” he said.
A particular problem in
baseball is “appearance bias”
– the notion that some
athletes look more like great
baseball players than others.
It’s also an issue in
business, DePodesta said,
citing a data point from
Malcolm Gadwell on height and
business success. Gladwell
found that although just 3.9
percent of American males are
6-foot-2 or taller, about 30

The Signal and the Noise:
Why Most Predictions Fail but Some Don't. Nate Silver
I do not go as far as a Popper in asserting
that such theories are therefore unscientific
or that they lack any value. However, the fact
that the few theories we can test have produced
quite poor results suggests that many of the
ideas we haven’t tested are very wrong as well.
We are undoubtedly living with many delusions
that we do not even realize. page 15

OECD (2012). OECD Technology Foresight Forum 2012 - Harnessing data as a new source of growth: Big
data analytics and policies. OECD Headquarters, Paris, France 22 October 2012

빅데이터와 SNS 시대의
연구정보 서비스의 과학화
• Scientometrics 와 Triple Helix 분야
의 지속적 성장과 학제간 확장
- Technometrics, Webometrics,
Informetrics
- 이용자 주도형 오픈 툴과 글로벌 A&I
서비스의 보편화 가속됨

Mike Thelwall: WA 2.0
http://lexiurl.wlv.ac.uk/index.html

March Smith: NodeXL
http://nodexl.codeplex.com/

Han Woo PARK
KrKWIC, WeboNaver, WeboDaum

ArcGIS 를 이용한 오픈데이터 툴 . 세계은행 데이터 등
cool

The Coming of Triple Divide?
There are three main gaps I’d like to emphasize
in the present/future of Big Data research
community:
1)Developing/Transitional VS
Developed/Advanced countries,
2)Researcher in academia VS Researcher in
commercial sector,
3)Researchers with computational skills VS
Less computational scholars.

Method used Developed
Country/Region
Developing
Country/Region
Mixed Region
N % N % N %
Social-Informetics 114 74.51 30 83.33 9 52.94
Scientometrics 28 18.30 6 16.67 8 47.06
Webometrics 11 7.19 0 0 0 0
Total 153 100 36 100 17 100
No. of articles in each category of methods by
the developed/developing division
Skoric, M. M. (2013, Online First). The implications of big data for developing and
transitional economies: Extending the Triple Helix?. Scientometrics.

Number of “Big data” papers per year
Halevi, G., & Moed, H. F. (2012).

Rousseau (2012)
We performed a similar search in the WoS (TS=“Big data”) on October
2, 2012, leading to 142 articles. We removed the oldest one (1974), and
kept 141 published during the period 1993-2012). Halevi and Moed
observed an over-exponential growth over the period 1970-2011, while
we found a growth curve that could best be described by a cubic
polynomial (R2=0.963, with year 1992=0), which is illustrated in Fig. 1.

Subject areas researching Big Data
Halevi, G., & Moed, H. F. (2012).

Geographical Distribution of Big Data papers
Halevi, G., & Moed, H. F. (2012).

Phrase map of highly occurring keywords 1999-2005
Halevi, G., & Moed, H. F. (2012).

Phrase map of highly occurring keywords 2006-2012
Halevi, G., & Moed, H. F. (2012).

 But, Halevi and Moed (2012), and Rousseau (2012) are
based on descriptive statistics. Therefore, we intend to
add the network perspective both in the social (in terms
of co-authorship) and semantic networks.
 Furthermore, we extend search queries to various
terminologies related to Data Science because the term
“big data” is regarded only as one among a list of policy
priority issues.
 We show where the research system in Data Science is
“hot” in terms of international collaborations and
prevailing semantics.

Park, H.W.@
, & Leydesdorff, L. (2013). Decomposing Social and Semantic Networks in Emerging
“Big Data” Research. Journal of Informetrics*. 7 (3), 756-765.

http://graphics.wsj.com/house-midterm-elections-facebook/

Economics in the age of big data
http://www.sciencemag.org/content/346/6210/1243089
.full

The rise of empirical economics
• Finally, data come with less structure. Economists
are used to working with “rectangular” data,
with N observations and K << N variables per
observation and a relatively simple dependence
structure between the observations. New data sets
often have higher dimensionality and less-clear
structure. For example, Internet browsing histories
contain a great deal of information about a person’s
interests and beliefs and how they evolve over time.
But how can one extract this information? The data
record a sequence of events that can be organized in
an enormous number of ways, which may or may not be
clearly linked and from which an almost unlimited
number of variables can be created. Figuring out how
to organize and reduce the dimensionality of large-
scale, unstructured data is becoming a crucial
challenge in empirical economic research.

Using Big Data to Fight Range
Anxiety in Electric Vehicles
• The software acquires
data from five
sources: Google Maps
(for route, terrain,
and traffic data),
Wunderground.com (for
weather), driver
history (through
driving behavior
measurements),
vehicle manufacturers
(for vehicle modeling
data), and battery
manufacturers (for
battery modeling
data).
http://spectrum.ieee.org/cars-that-think/transportation/sensors/using-big-data-to-fight-range-anxiety-in-electric-
vehicles

Bi-linked network of politically active
A-list Korean citizen blogs (July 2005)
2005 년 한국정치 파워블로거와 국회의원
URI=Centre
DLP=Left
GNP=Right
Just A-list blogs exchanging links with politicians

Affiliation network using pages linked to Lee’s and Park’s sites
이명박과 박근혜 후보 사이트의 인터넷 네트워크
N = 901 (Lee: 215, Park: 692, Shared: 6)

e- 리서치 도구의 활용 : 웹가시성 분석
 블로그 공간에서 후보자들의 웹가시성 수준과 득표 수
간에 밀접한 상관성을 나타냄 . ( 임연수 , 박한우 , 2010,
JKDAS)
실제 득표수
평균 블로그 수

2009 년 10 월 28 일 재보선 결과
- 당선자 모두 블로그 가시성 높음

I. 소셜 미디어의 특징 및 영향력
10.26 재보궐 선거 사례
(2)
• 페이스북에서 이름이 동시에 언급되는 이름 연결망을
구성하여 분석
• 초반에는 두 후보자가 비슷하게 언급되다가 ,
중반에 접어들자 박원순 지지자들과 박원순이 언급
되면서
나경원 후보자 지지자가 안보이게 되고 ,
종반에는 박원순 중심으로 네트워크가 재편되며 종
결됨

I. Semantic network 의미망에서 중심성 비교
10.26 재보궐 선거 사례
(2)
• 서울시장 선거 관련 메세지들의
내용을 분석하여 나오는 단어들의
빈도 분석
• 초반부터 나경원 후보는 빈도가
떨어지다가 , 후반에 박원순 후보
와 경쟁 및 선거 결과를 이야기하
면서 나타나는 경우를 제외하고는
줄곳 담론외곽에 존재
• 안철수 효과는 초반에 크고 , 중
반이후 떨이지는 효과가 나타났으
나 , 한나라당이라는 언급이 높게
나오면서 집권여당에 반하는 정서
가 나타나 , 선거의 성격을 말해
줌

 Figure 4. T Values for Bilateral Relationships between Park and Moon
 트위터 , 페이스북 , 구글에서 나타난 박근혜와 문재인 후보 간 트리플헬릭스 지
수 값
19 대 대통령 선거

http://www.dt.co.kr/contents.html?article_no=2014121702100960718001

http://www.yeongnam.com/mnews/newsview.do?mode=newsView&newskey=20140604.010060719390001

http://news.chosun.com/site/data/html_dir/2011/05/11/2011051100195.html?
news_topR

Yet, there still are serious problems to overcome. A
trenchant critique concerning the big data field as it is
nowadays came in the form of six statements intending to
temper unbridled enthusiasm. [42]
These six provocative
statements are:
 Big data change the definition of knowledge;
 Claims to accuracy and objectivity are misleading;
 More data are not always better data;
 Taken out of context, big data loses its meaning;
 Just because it is accessible, it does not make it ethical; and
 (Limited) access to big data creates a new digital divide.
Rousseau (2012)

Big Data's Slippery Issue of
Causation vs. Correlation

박한우 , 소셜 여론조사의 실제와 과제 - ‘ 저비용 고효
율’ SNS 로 여론 읽기 . 월간 < 신문과 방송 >, 2012
년 7 월 . 84-88 쪽 .

대구시의 조직 개편 ( 안 ) 과 오픈 데이터
경제
 뉴욕대가 조사한 바에 따르면 미국에서 활동하는 오픈
데이터 기업이 500 개이며 , 이 가운데 3 분의 2 는 최
근 5 년 이내에 설립되었다고 한다 .
 부동산 회사인 질로 (Zillow) 는 좋은 사례다 . 이 회사는
주택 보유자 , 구매자 , 판매자 , 임대업자 , 중개업자 , 대
부업자 , 땅 주인 , 감정평가사에게 꼭 필요한 정보의 검
색과 공유를 촉진하기 위한 온라인 마켓플레이스를 제
공한다 .
 우편번호만 입력하면 학군과 안전도 등 부동산 관련
정보에 접근이 가능하다 . 이 서비스는 1 억 1 천만건이
넘는 미국 주택 데이터를 기반으로 만들어져 현재 30
억달러 이상의 자본을 시장에 유통시키는 효과를 가져
왔다 .

http://www.opendata500.com/us/list/

빅데이터에 대한 부정적인 시각 등장
- 빅데이터의 가치
- 저장 , 분석 및 해석기술 한계 존재
- 현재의 붐은 호들갑스러운 측면 존재
빅데이터 갭 : Promise VS Capabilities
빅데이터의 도전

빅데이터의 도전 빅데이터 ‘ Gap’ 분석사례
· 151 명 연방 정부 CIO 및 IT 관리자 대상 빅데이터갭 조사실시 .
· 실질적으로 현재 데이터를 제대로 활용하고 있는 기관도 적으며 , 데
이터소유권 문제도 확립되지 않은 것으로 나타남 .
[ 정부美 IT 네트워크 ‘ Meritalk’ 는 빅데이터
의 가능성과 현실에는 Gap 이 존재한다고
분석 ]

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

어떤 실험을 하는지 우리는 알고 있는가 ?
http://www.nature.com/news/facebook-experiment-boosts-us-voter-turnout-1.11401

우리는 정확히 인지하지 못한 채 동의했다

User Content VS Site Content
대부분의 SNS 서비스는 “ User Content” 를
무력하게 만드는 “ Site Content” 규정이 있
음 (p. 60).

3.결론 및결론 및
시사점시사점
기술 + 사회문화적 요소에 대한 면밀한 검토
- 빅데이터 및 AI 논의에서 빠지지 않는 것이 개인정보 유출 및
사생활
침해와 같은 역기능 문제
- 기술의 발전과 더불어 우리가 원하는 미래상에 대한 명확한 이
해와 ,
이를 달성하기 위한 정치사회적 기반에 대한 근본적인 모색이
박한우 교수는 2012 년 2 월에 미국에서 벌어
진 사건을 예로 들었다 . 영국의 대학생 두 명
이 미국에 입국하면서 로스앤젤레스 공항을
폭파하겠다는 말을 트위터에 썼는데 이것이
미국 정부에 적발됐다 . 박 교수는 “이 경우 정
부는 트위터 전체가 아니라 트위터에 글을 올
린 사람을 , 올린 것을 규제한 것인데 미국 정
부가 일상적으로 트위터를 들여 다본다는 문
제로 번졌다”고 설명했다 .

Guardian 소셜 데이터저널리즘 10 계명
 It may be trendy but it’s not new
 Open data means open data journalism
 Has data journalism become curation?
 Bigger datasets, smaller things
 Data journalism is 80% perspiration, 10% great idea, 10% output
 Long and short-form
 Anyone can do it…
 … but looks can be everything
 You don’t have to be a programmer
 It’s (still) all about stories
http://www.guardian.co.uk/news/datablog/2011
/jul/28/data-journalism

Prof. Han Woo PARK
World Class University Webometrics Institute
CyberEmotions Research Center
Department of Media and Communication,
YeungNam University, Korea
hanpark@ynu.ac.kr www.hanpark.net
이 슬라이드 작성에 도움을 준 사이버감성연구소 연구원들과
학부 / 대학원 강의 수강생에게 고마움을 표시합니다 .
이 슬라이드는 개인적 목적으로 만든 비공개 자료입니다 .
배포 및 복사를 금지합니다 .

Mapping (big) data science (15 dec2014)대학(원)생

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Mapping (big) data science (15 dec2014)대학(원)생

Similaire à Mapping (big) data science (15 dec2014)대학(원)생 (20)

Plus de Han Woo PARK

Plus de Han Woo PARK (20)

Dernier

Dernier (20)

Mapping (big) data science (15 dec2014)대학(원)생

Notes de l'éditeur