Digital methods for Social Sciences: origin and definitions

DIGITAL METHODS
FOR SOCIAL SCIENCES
Marta Severo – Université de Lille 3, Laboratoire Gériico
marta.severo@univ-lille3.fr
9 August 2013, University of Sao Paulo, Escola de
comunicaçoes et artes (ECA/USP)

PROGRAMME
1 Day : Digital methods: definitions & objects
2 Day : Scientometrics and network analysis
3 Day : Web mapping
4 Day : Collecting and analysing data

KEY QUESTIONS
1.  Why do we analyze the data on the web?
2.  Which data can we find?
3.  Which objects can we study?
4.  How to build a web-based corpus?

WHY DO WE ANALYZE
THE DATA ON THE WEB?

DIGITAL METHODS
A series of methods that share
the fact of being based on the
digital traces as a source of
information for studying social
phenomena.
R. Rogers, « Internet Research: The Question of Method », Journal
of Information Technology and Politics, 7, 2/3, 2010, 241-260

THE WEB AS AN OBJECT OF STUDY
Photo credit – Brandon Doran via Flickr - ©

THE WEB AS A SOURCE OF INFORMATION
Chris Harrison, 2007
Internet map (World City-to-City Connections)

THE RISE OF DIGITAL METHODS
Virtual reality
Late ‘80-early ‘90 (Barlow, Turkle, Negroponte, Rheingold)
Virtual society?
1997-2002 (Steve Woolgar et al.)
Cultural analytics
2007 (Lev Manovich)
Digital methods
2009 (Richard Rogers)

DIGITAL METHODS
 The issue no longer is how much of society and
culture is online, but rather how to detect
cultural change and societal conditions with the
Internet.
 The conceptual point of departure for the
research programme is the recognition that the
Internet is not only an object of study, but also a
source.
R. Rogers, "Internet Research: The Question of Method," Journal
of Information Technology and Politics, 7, 2/3, 2010, 241-260

PORTRAIT GOOGLE OF MARK L***
http://www.le-tigre.net/Marc-L.html

FACEBOOK FRIENDS NETWORK
Paul Butler, 2010, Visualizing Friendships

RICH DATA AOL user 711391 search history
www.minimovies.org/documentaires/view/ilovealaska
By Lernert Engelberts and Sander Plug

LARGE POPULATIONS AND RICH DATA
Google Flu www.google.org/flutrends

Ginsberg J. et al.,
« Detecting influenza
epidemics using search
engine query data »,
Nature 457, 1012-1014
(19 February 2009)

http://www.google.org/flutrends/intl/pt_br/br/#BR

EPIDEMIOLOGY OF DISEASES
Dengue Trends www.google.org/denguetrends

EPIDEMIOLOGY OF RECIPES
 Thanksgiving Trends
http://www.nytimes.com/interactive/2009/11/26/
us/20091126-search-graphic.html

GOOGLE INSIGHT FOR SEARCH AND
GOOGLE CORRELATE google.com/trends
google.com/trends/correlate/

(BE CAREFUL THOUGH)
  Askitas, N., & Zimmermann, K. (2011). Health and Well-Being
in the Crisis. IZA Discussion Paper

(BE CAREFUL THOUGH)
  http://googlesystem.blogspot.fr/2008/08/google-
suggest-enabled-by-default.html

APOCALYPSE 2012
HTTP://WWW.YOUTUBE.COM/WATCH?V=QZBDSYWQNMC
 Documentaire maya sui dati….

INTERNET IN NUMBERS
 677 million active Web sites in the world
in 2012
 In 2008, Google stated that their robots
had crawled 1 trillion of url
 In 2002, we count 25 billion documents,
7.5 million new pages per day, 150
terabytes of information, 690 billion
pages in intranet sites.

Invisible web
Social
networks
Google
Open
data
Data on the web

SURFACE WEB VS INVISIBLE WEB
 The surface Web is made up of all the
pages indexed by various search engines
  The invisible web or deep consists of
non-indexed pages. It is hidden part of
the web. Few people know of its
existence and yet it is a huge source of
information

WHY INVISIBLE
  Documents, web pages and websites or
databases too large to be fully indexed. (ie.
Internet Movie Database www.imdb.fr)
 Pages protected by copyright (meta tag
which stops the robot).
 Dynamically generated pages
 Pages protected with login and password
 Formats not read by search engines (Flash)

DATABASE
 These resources are changing. Few
years ago they could accessed with fee
 Today, more and more quality
information, particularly through the
databases, become free.
 High profile databases such as Lexis
Nexis, Dialog Datastar, Factiva, STN
International, Questel .... are only just
over 1% of the Deep Web.

INTERNET ARCHIVE
 Internet archive (http://archive.org/index.php)
is a digital library designed to preserve all
digital documents of the internet in order to
preserve them from a complete
disappearance. The IA provides documents
since 1996 (10 billion web pages but also
usenet, movies and ancestor ARPANET).
 The Internet Wayback Machine (developed
especially with Alexa Internet) allows the user
to find archived websites by simply typing its
URL and the desired date)

GOOGLE NEWS ARCHIVE
 https://news.google.com/news/
advanced_news_search?as_drrb=a Google
News Archive, which allows you to search
among the archives of the News ..... 200
years!
 You can easily search through keywords
within news from free and paid sources. In
fact, Google has agreements with
prestigious news sources such as Time, the
Wall Street Journal, New York Times, the
BBC, the Guardian or The Washington Post

DIRECTORIES
 Selective directories identify professional
Internet resources selected on qualitative
criteria ...
 Sites are selected by information
professionals to cover the areas of
university research and the overall
education
 Exemple : http://aip.completeplanet.com/
 http://www.ipl.org/ (performed by
librarians)

PORTALS
 Sites combining many resources
(articles, forums, news...) that can be
organized around a theme (vertical
portals).
 http://www.enfin.com/

NEWS
 Search engine news services, push
news, custom press releases, press
releases, news portals, daily newspaper
sites or business newspapers, directories
of national and international media, press
archives ...

NEWS ARCHIVE
 With fee:
  Factiva
  Europresse
 Without fee:
  http://voxaleadnews.labs.exalead.com/
  http://emm.newsexplorer.eu/
  Pickanews www.pickanews.com

NEWS
 http://emm.newsexplorer.eu/

NEWS (HTTPS://NEWS.GOOGLE.COM/ )
https://news.google.com/

OPEN DATA
 Open data is the idea that certain data
should be freely available to everyone to
use and republish as they wish, without
restrictions from copyright, patents or
other mechanisms of control.
 Philosophy: The data collected for the
public good must return to public.

SOME EXAMPLES
 International organisations
 Countries
 Cities
 General catalogues

FAO (HTTP://FAOSTAT3.FAO.ORG/)

WHO (
HTTP://WHO.INT/RESEARCH/FR/INDEX.HTML )

WORLD BANK (HTTP://DATA.WORLDBANK.ORG/ )

REGIONAL DATABASE (
HTTP://WWW.GOVERNOABERTO.SP.GOV.BR/)

PUBLIC OR PRIVATE ENTERPRISES
 Ratp (public transport in Paris)
http://data.ratp.fr/
 SNCF (train in France)
http://www.data.sncf.com/
 Eau France (water in France)
http://www.services.eaufrance.fr/

HTTP://INFOAMAZONIA.ORG/
  Gustavo Faleiros, Coordenador do Projeto, Knight
International Fellow – O Eco

CATALOGUES
 http://www.data-publica.com/
 http://dashboard.opengovernmentdata.org/
  http://datacatalogs.org/
 http://www.quora.com/Data/What-is-the-most-
comprehensive-list-of-international-open-
government-datasets
 http://www.data.gov/opendatasites
 http://www.gapminder.org/
 http://www.statista.com (registration)

HANS ROSLING MOSTRA AS MELHORES
ESTATÍSTICAS QUE VOCÊ JÁ VIU
http://www.ted.com/talks/lang/pt-br/
hans_rosling_shows_the_best_stats_you_ve_ever_seen.html

WEB 2.0
 The classics : newsletters, newsgroups and
forums
 Facebook
 Professional networks
 Twitter
 Wikipedia
 Blogosphere
 ….

TYPE OF DATA
 Texts
 Images
 Video
 Audio
 …

DATA
Web pages
Documents
Blogs
Forums
Tweets
Facebook
Google search
Wiki …
OBJECTS
Actors
Connections
Events
Products
Sentiments
…

1. ACTORS
 Discourse mapping on the web :
From a corpus of web documents we can
trace the connections between the
authors of the documents through the
analysis of occurrences or co-
occurrences of keywords
 2 examples:
  Media representations of the Mediterranean
solar plan (http://www.martasevero.com/?p=154 )
  Egyptians abroad (
http://www.e-diasporas.fr/working-papers/
Severo&Zuolo-Egyptian-EN.pdf )

2. CONNECTIONS - NETWORKS
 Web mapping : is based on the idea that
hyperlinks created on the web can be
used as a proxy for social ties. (analysis
of the graph of the network created by
hyperlinks on a set of web pages)

EXEMPLES
M. Severo, T. Venturini, "Intangible Cultural Heritage Webs Comparing national
networks with digital methods", in New Media & Society, forthcoming (pre-print
http://goo.gl/FPpTx)

MEME STUDY
 A memetracker is a tool for studying the
migration of memes across a group of
people
 A meme is « an idea, behavior, or style
that spreads from person to person within
a culture »
 MemeTracker.org (Jure Leskovec), a tool
celebrated for allowing a new way of
examining media through watching how
quotes spread through professional and
citizen media -

  Leskovec, J., L. Backstrom, and J. Kleinberg. 2009. Meme-tracking
and the Dynamics of the News Cycle. In Proceedings of the 15th
ACM SIGKDD international conference on Knowledge discovery and
data mining, 497–506.

3. EVENT STUDY
 Just-in-time identification
of international media events:
the case of the Wukan’s protests
http://jitso.org/2012/12/02/the-wukans-
protests-just-in-time-identification-of-
international-media-events-revised/

HOW TO BUILD A WEB-BASED CORPUS?

 First step: identify the goals of your
research (which objects?)
 Second step: identify the data sources
(which data?)
 Third step: define the exploratory
method for collecting and analysing data

FROM THE GOAL TO THE CORPUS
 The analyst is one dimension of the
analysis. We must clarify his position to give
keys to who will read the analysis (context)
 It should make explicit the criteria
  Cover a maximum of Items = open issue
  Production of corpus as the research goes
on = issue focused on a specific point
 Are there standards for the preparation of
the corpus, defined research hypotheses
(which are gradually refined)?

PROBLEMS OF WEB DATA
 The corpus is exhaustive ? (ex. Factiva)
 The corpus is homogeneous ?
 The corpus is representative ?
Corpus Analysis

3 DEGREES OF FREEDOM OF TREATMENT
 Difference between the data and the
actual objects
 Difference between the data and the
analysed data
 Difference between the research output
and the interpretation of the researcher
or reader

TOOLS FOR ANALYSING THE CORPUS
 There is some interpretation from the
cleaning of the corpus
 We can transform data in the different
phases of the analysis
 Visualisation can modify the data

THE CHOICE OF CORPUS IS NOT NEUTRAL
 In the source of the corpus
 In the format of documents
 In the search query
 We have a clear idea of the corpus to
extract

DISCUSSION
 Your questions?
 Your objects?
 Your data?

Digital methods for Social Sciences: origin and definitions

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (8)

En vedette

En vedette (15)

Similaire à Digital methods for Social Sciences: origin and definitions

Similaire à Digital methods for Social Sciences: origin and definitions (20)

Plus de Marta Severo

Plus de Marta Severo (9)

Dernier

Dernier (20)

Digital methods for Social Sciences: origin and definitions