Digital methods for Social Sciences: origin and definitions
1. DIGITAL METHODS
FOR SOCIAL SCIENCES
Marta Severo – Université de Lille 3, Laboratoire Gériico
marta.severo@univ-lille3.fr
9 August 2013, University of Sao Paulo, Escola de
comunicaçoes et artes (ECA/USP)
2. PROGRAMME
1 Day : Digital methods: definitions & objects
2 Day : Scientometrics and network analysis
3 Day : Web mapping
4 Day : Collecting and analysing data
3. KEY QUESTIONS
1. Why do we analyze the data on the web?
2. Which data can we find?
3. Which objects can we study?
4. How to build a web-based corpus?
5. DIGITAL METHODS
A series of methods that share
the fact of being based on the
digital traces as a source of
information for studying social
phenomena.
R. Rogers, « Internet Research: The Question of Method », Journal
of Information Technology and Politics, 7, 2/3, 2010, 241-260
7. THE WEB AS A SOURCE OF INFORMATION
Chris Harrison, 2007
Internet map (World City-to-City Connections)
8. THE RISE OF DIGITAL METHODS
Virtual reality
Late ‘80-early ‘90 (Barlow, Turkle, Negroponte, Rheingold)
Virtual society?
1997-2002 (Steve Woolgar et al.)
Cultural analytics
2007 (Lev Manovich)
Digital methods
2009 (Richard Rogers)
9. DIGITAL METHODS
The issue no longer is how much of society and
culture is online, but rather how to detect
cultural change and societal conditions with the
Internet.
The conceptual point of departure for the
research programme is the recognition that the
Internet is not only an object of study, but also a
source.
R. Rogers, "Internet Research: The Question of Method," Journal
of Information Technology and Politics, 7, 2/3, 2010, 241-260
26. INTERNET IN NUMBERS
677 million active Web sites in the world
in 2012
In 2008, Google stated that their robots
had crawled 1 trillion of url
In 2002, we count 25 billion documents,
7.5 million new pages per day, 150
terabytes of information, 690 billion
pages in intranet sites.
29. SURFACE WEB VS INVISIBLE WEB
The surface Web is made up of all the
pages indexed by various search engines
The invisible web or deep consists of
non-indexed pages. It is hidden part of
the web. Few people know of its
existence and yet it is a huge source of
information
30. WHY INVISIBLE
Documents, web pages and websites or
databases too large to be fully indexed. (ie.
Internet Movie Database www.imdb.fr)
Pages protected by copyright (meta tag
which stops the robot).
Dynamically generated pages
Pages protected with login and password
Formats not read by search engines (Flash)
31. DATABASE
These resources are changing. Few
years ago they could accessed with fee
Today, more and more quality
information, particularly through the
databases, become free.
High profile databases such as Lexis
Nexis, Dialog Datastar, Factiva, STN
International, Questel .... are only just
over 1% of the Deep Web.
32. INTERNET ARCHIVE
Internet archive (http://archive.org/index.php)
is a digital library designed to preserve all
digital documents of the internet in order to
preserve them from a complete
disappearance. The IA provides documents
since 1996 (10 billion web pages but also
usenet, movies and ancestor ARPANET).
The Internet Wayback Machine (developed
especially with Alexa Internet) allows the user
to find archived websites by simply typing its
URL and the desired date)
33. GOOGLE NEWS ARCHIVE
https://news.google.com/news/
advanced_news_search?as_drrb=a Google
News Archive, which allows you to search
among the archives of the News ..... 200
years!
You can easily search through keywords
within news from free and paid sources. In
fact, Google has agreements with
prestigious news sources such as Time, the
Wall Street Journal, New York Times, the
BBC, the Guardian or The Washington Post
34. DIRECTORIES
Selective directories identify professional
Internet resources selected on qualitative
criteria ...
Sites are selected by information
professionals to cover the areas of
university research and the overall
education
Exemple : http://aip.completeplanet.com/
http://www.ipl.org/ (performed by
librarians)
35. PORTALS
Sites combining many resources
(articles, forums, news...) that can be
organized around a theme (vertical
portals).
http://www.enfin.com/
36. NEWS
Search engine news services, push
news, custom press releases, press
releases, news portals, daily newspaper
sites or business newspapers, directories
of national and international media, press
archives ...
37. NEWS ARCHIVE
With fee:
Factiva
Europresse
Without fee:
http://voxaleadnews.labs.exalead.com/
http://emm.newsexplorer.eu/
Pickanews www.pickanews.com
40. OPEN DATA
Open data is the idea that certain data
should be freely available to everyone to
use and republish as they wish, without
restrictions from copyright, patents or
other mechanisms of control.
Philosophy: The data collected for the
public good must return to public.
49. PUBLIC OR PRIVATE ENTERPRISES
Ratp (public transport in Paris)
http://data.ratp.fr/
SNCF (train in France)
http://www.data.sncf.com/
Eau France (water in France)
http://www.services.eaufrance.fr/
52. HANS ROSLING MOSTRA AS MELHORES
ESTATÍSTICAS QUE VOCÊ JÁ VIU
http://www.ted.com/talks/lang/pt-br/
hans_rosling_shows_the_best_stats_you_ve_ever_seen.html
59. 1. ACTORS
Discourse mapping on the web :
From a corpus of web documents we can
trace the connections between the
authors of the documents through the
analysis of occurrences or co-
occurrences of keywords
2 examples:
Media representations of the Mediterranean
solar plan (http://www.martasevero.com/?p=154 )
Egyptians abroad (
http://www.e-diasporas.fr/working-papers/
Severo&Zuolo-Egyptian-EN.pdf )
60. 2. CONNECTIONS - NETWORKS
Web mapping : is based on the idea that
hyperlinks created on the web can be
used as a proxy for social ties. (analysis
of the graph of the network created by
hyperlinks on a set of web pages)
61. EXEMPLES
M. Severo, T. Venturini, "Intangible Cultural Heritage Webs Comparing national
networks with digital methods", in New Media & Society, forthcoming (pre-print
http://goo.gl/FPpTx)
66. MEME STUDY
A memetracker is a tool for studying the
migration of memes across a group of
people
A meme is « an idea, behavior, or style
that spreads from person to person within
a culture »
MemeTracker.org (Jure Leskovec), a tool
celebrated for allowing a new way of
examining media through watching how
quotes spread through professional and
citizen media -
67. Leskovec, J., L. Backstrom, and J. Kleinberg. 2009. Meme-tracking
and the Dynamics of the News Cycle. In Proceedings of the 15th
ACM SIGKDD international conference on Knowledge discovery and
data mining, 497–506.
68. 3. EVENT STUDY
Just-in-time identification
of international media events:
the case of the Wukan’s protests
http://jitso.org/2012/12/02/the-wukans-
protests-just-in-time-identification-of-
international-media-events-revised/
70. First step: identify the goals of your
research (which objects?)
Second step: identify the data sources
(which data?)
Third step: define the exploratory
method for collecting and analysing data
71. FROM THE GOAL TO THE CORPUS
The analyst is one dimension of the
analysis. We must clarify his position to give
keys to who will read the analysis (context)
It should make explicit the criteria
Cover a maximum of Items = open issue
Production of corpus as the research goes
on = issue focused on a specific point
Are there standards for the preparation of
the corpus, defined research hypotheses
(which are gradually refined)?
72. PROBLEMS OF WEB DATA
The corpus is exhaustive ? (ex. Factiva)
The corpus is homogeneous ?
The corpus is representative ?
Corpus Analysis
73. 3 DEGREES OF FREEDOM OF TREATMENT
Difference between the data and the
actual objects
Difference between the data and the
analysed data
Difference between the research output
and the interpretation of the researcher
or reader
74. TOOLS FOR ANALYSING THE CORPUS
There is some interpretation from the
cleaning of the corpus
We can transform data in the different
phases of the analysis
Visualisation can modify the data
75. THE CHOICE OF CORPUS IS NOT NEUTRAL
In the source of the corpus
In the format of documents
In the search query
We have a clear idea of the corpus to
extract