Daphne Duin, David King and Peter van den Besselaar
Symposium of the Oxford Internet Institute: Social Science and Digital Research: Interdisciplinary Insights12 March 2012
Strategies for Landing an Oracle DBA Job as a Fresher
Wish you were here before!' Who Gains from Collaboration between Computer Science and Social Research?
1. ViBRANT
Virtual Biodiversity Research
‘Wish you were here before!’
Who gains from collaboration between computer science
and social research?
Daphne Duin, David King, Peter van den Besselaar
Dep. of Organization Sciences & Network Institute, VU-University Amsterdam
Department of Computing, The Open University, Milton Keynes
Social Science and Digital Research: Interdisciplinary Insights,
March, 12, 2012, Oxford e-Research Centre
2. ViBRANT
Virtual Biodiversity Research
Help! How is this social data?
Time taken to serve the request (microseconds) Host name (equates to Scratchpad) """Full URL"" (in quotes)"
Origin of request (IP address) F5 Time the request was received (e#g# (01/Apr/2011:11:17:42 +0100)
"""First line of request"" (in quotes)" Status of final request (e#g# 200, 301, etc) Size of the response in
bytes Remote logname (Almost always blank) """Referer"" (in quotes)"
able.myspecies.info http://able.myspecies.info/favicon.ico 24.218.227.223 -- [14/Jul/2010:19:54:06
GET /favicon.ico HTTP/1.1 200 198 - Mozilla/5.0 (Macintosh; U; Intel Mac OS X
10.6; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6
polychaetes.info http://polychaetes.info/node/add/forum/forum/ 24.229.196.151 --
[14/Jul/2010:20:16:48 GET /node/add/forum/forum/ HTTP/1.0 301 -
http://polychaetes.info/node/add/forum/forum/ Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x
4.90; Creative)
ciliateguide.myspecies.info http://ciliateguide.myspecies.info/node/add/forum/forum/ 24.229.196.151 --
[14/Jul/2010:20:39:14 GET /node/add/forum/forum/ HTTP/1.0 301 -
http://ciliateguide.myspecies.info/node/add/forum/forum/ Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
MRA 4.6 (build 01425); MRSPUTNIK 1, 5, 0, 19 SW)
ciliateguide.myspecies.info http://ciliateguide.myspecies.info/node/add/forum/forum 24.229.196.151 --
[14/Jul/2010:20:39:22 GET /node/add/forum/forum HTTP/1.0 200 25219
http://ciliateguide.myspecies.info/node/add/forum/forum Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
MRA 4.6 (build 01425); MRSPUTNIK 1, 5, 0, 19 SW)
ciliateguide.myspecies.info http://ciliateguide.myspecies.info/node/add/forum/forum 24.229.196.151 --
[14/Jul/2010:20:39:37 POST /node/add/forum/forum HTTP/1.0 200 27128
http://ciliateguide.myspecies.info/node/add/forum/forum Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
MRA 4.6 (build 01425); MRSPUTNIK 1, 5, 0, 19 SW)
ciliateguide.myspecies.info http://ciliateguide.myspecies.info/node/add/forum/forum 24.229.196.151 --
[14/Jul/2010:20:39:47 GET /node/add/forum/forum HTTP/1.0 200 25219
http://ciliateguide.myspecies.info/node/add/forum/forum Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
MRA 4.6 (build 01425); MRSPUTNIK 1, 5, 0, 19 SW)
26141 wallacefund.info http://wallacefund.info/robots.txt 38.101.148.126 --
[15/Jul/2010:03:48:42 GET /robots.txt HTTP/1.1 200 44 - Mozilla/5.0
(compatible; discobot/1.1; +http://discoveryengine.com/discobot.html
mhp.myspecies.info http://mhp.myspecies.info/robots.txt 38.101.148.126 -- [15/Jul/2010:03:48:49
GET /robots.txt HTTP/1.1 200 44 - Mozilla/5.0 (compatible; discobot/1.1; +
3. ViBRANT
Virtual Biodiversity Research
Interdisciplinary work for e-science
E-science
1. Application of an e-infrastructure to do science
2. The study of the design, uptake and use of e-Science
E-infrastructure: Scratchpads, online platform for
biodiversity research
Need: Developing alternative evaluation metrics for e-
science
Goal: Identification of different types of users
Approach: Collaboration between social science and
omputer science valuable for e-science
4. ViBRANT
Virtual Biodiversity Research
What is the impact of e-science?
Question from e-science facility to social scientists
Identification of different types of users
Who are visiting Scratchpad platform?
Web data (eg server log files)
Identify Internet Service Providers visiting
Scratchpads
Cluster Internet Service Providers visiting
Scratchpads, into meaningful categories
5. ViBRANT
Virtual Biodiversity Research
Material
Standard web analytics report of Scratchpads
>300 community sites
> 5,000 registred users (unpaid)
Public and closed content
Names of 6,728 unique Internet Service Providers
(ISPs) (6 months)
natural history museum telstra internet verizon online llc
freie universitaet berlin
queensland department of natural resources and water
Gemeente maastricht
national parks board (ministry of national development)
agriculture and agrifood canada
Commission europeenne
u.s. fish and wildlife service irm/bfo hqstate of nebraska / office of
6. ViBRANT
Virtual Biodiversity Research
Social scientists and computer scientists
First trying alone…
….marina|marine|medical|medisch|microsoft|mineral|mining|ministerie|
ministry|monsanto|museo|museum|national
park|naval|navy|nerc|news|novartis|observatoire|office….
Then question to computer scientist
...from social scientists: could you help us to better...
• collect web data?
• refine/cluster the data ?
• develop tools/methods for measuring robustness of
data?
7. ViBRANT
Virtual Biodiversity Research
Altmetrics for e-science: a social science and
computer science project
“to what extent can we improve a human developed method
with computational techniques, in order to cluster ISPs into
meaningful categories representing the various audiences
using Scratchpads? “
8. ViBRANT
Virtual Biodiversity Research
Method computer scientist
Identify Internet Service Providers visiting
Scratchpads, removing noise
Inductive logic program, Aleph
Cluster Internet Service Providers visiting Scratchpads
into meaningful categories
Bayesian classifier
9. ViBRANT
Virtual Biodiversity Research
Results: Identification of ISPs
Manually build filter (181 terms)
- accuracy 94%
- precision 92%
- recall 97%
Many hours of work
Computational filter (6 terms)
- accuracy 84% Comparison of filters 6 term filter set
120% 181 term filter set
- precision 98% 100%
- recall 73% 80%
c
Couple of minutes 60%
40%
20%
0%
precision: recall: f-measure:
10. ViBRANT
Virtual Biodiversity Research
Results: Clustering ISPs in meaningful
categories ISPs by Sector
Manual method: filter with key
government
words industry
“university” “research” “school” media/arts
“museum” research/edu
Problematic!
Computational method: classifiers
- 90% accuracy
Couple of minutes! Classifier Accuracy
100%
90%
80%
70%
60% Simple
50%
Bayes
40%
30%
20%
10%
0%
Sector Level Focus
Tiers
11. ViBRANT
Virtual Biodiversity Research
Who gains from collaboration between
computer science and social research?
• E-science facilities, e-science uptake and
implementation
• Social Science and
• Computer Science
12. ViBRANT
Virtual Biodiversity Research
Acknowledgments
ViBRANT –http://vbrant.eu
Scratchpads –http://scratchpads.eu/
Laura Hollink for her help with the raw log files
Simon Rycroft for his help with the web analytics reports
Vince Smith for sharing presentation material
Notes de l'éditeur
It all started with a 19GB file with this type of data, which was send to the social scientists by an e-science facility with the question: “Please tell us who our users are and how they are using our infrastructure”. The file contained transaction logs from users visiting the infra.E-science facilities generates electronic data. These are the digital footprints of users and usages that are stored in the logs of the e-infrastructure and tell us when the infra is accessed, information is downloaded, uploaded or edited. In other words, how users ‘behaved’. We learnt from this experience that this type of “electronice use” data is characterized by:-Large data sets -Fuzzy data (ambiguous) (example uni spelled wrong)Where did this question come from /motivated byhow does this link to different definitions of e-science?
We understand e-science as:1. Application of an e-infrastructure to do science2. The study of the design, uptake and use of e-ScienceAs you can see the expertise from CS and SS are apparent in the definition of e-science …We’ll demonstrate in this presentation how a collaborative project that we set up contributes to the development of ScratchpadsE-infra: Scratchpads are an online platform for scientists facilitating e-science in the field of Biodiversity Research.Need:With more and more scientific work moving to the web or into databases there is a growing need to understand the impact this change has for science, scientists and the users of scientific information. Goal: Crucial in such an impact is the identification of different types of users and use. Approach: Interdisciplinary work of CS and SS, sophisticated data treatment is needed to give itmeaning in the context of evaluation.
So we had a question, who are the visitors of SPs?And a file withelectronic use data ...the challenge then was how to analyse the data and to know how robust the data are.Identify:Wedecided to start with identifying “the users”. Web analytics packages can be used to generate information on the visitors (users), notably through identification of the names of the visiting Internet Service Providers (ISPs). Through the name of the ISP, (i.e. ‘VrijeUniversiteit’) we may be able to identify the nature and activities of the users. Clusters: Additionally, and next to identification we also wanted to cluster the ISP into categories that make sense for evaluation purposesWe were in particular interested to see the partition of academic users versus other educational users and sectors such as government and business as this could tell us something about the (societal) impact of the e-infrastructure.
We zoom further in on the data we had access toWe left the raw log files for what they were and used standard web analytics report...we made this decision after consulting a computer scientistsWe are looking at 300 websites at once! Generrates a long list of ISPs. The list contains ISPs that are clearly part of the community of BR, government, fundersWe call the first examples‘specific ISPs’And the rest ‘ general ISPs’We filter the general ISPs out.The others are Relevant for VIBRANT and evaluation purposes
First the Social scientists tried to handle the data alone, manually developed a natural expression filter based on 181 ‘include’ terms based on many hours of work. We run into several limits,technically what the system allowed us to do and our skills...we couldn't work around the limits. We discussed our problems with David and it turned out that we questions we had were Also interesting questions for CSThis is how we decided to join forces and started a project together. David will tell you now something about the computer science contribution and the outcomes of our work
The social scientists produced a 181-term filter set after many hours of effort that gave 94% accuracy, whereas the computer scientist produced a 6-term filter set in a couple of minutes that gave 84% accuracy. The tested computer-aided filtering reached a higher precision than the manually‑developed filter (98% vs 92%) though for the recall in this initial test favored the manual approach (73% vs 97%).
Meaningful categories in this context are categories that The manual process highlighted a problem with continuing to use keywords to categorize ISPs. Some categories are easily made up from words in the name of the full ISP such as “university” or “research” and could be grouped under the tier one category “research & education”. However, this approach is limited. For example, to simply categorize all ISPs who had within their name the terms “health” or “medic*” as “public health” meant that a range of research, educational, governmental and corporate affiliated ISPs were wrongly classified. Therefore, we were encouraged to categorize ISPs using classifiers rather than by extending our work with filters.
Interdisciplinary work of CS and SS will bring to e-science enhanced insights on the actual use and usage of the e-science environment based on robust (log) data and analysis, in a relative short amount of time2. Social science will benefit from working with CS because of increased scale and speed of data collection and analysis and for their insight in the technological boundaries/charateritics. 3. CS will benefit because collaboration provides opportunity to demonstrate their engineering insights (tool building for the e-science facility as well as tools for analyzing social science data sets); 2) access to large datasets with behavioral/user information which are nice cases to test computer science theories Possible costs:Above we listed several reasons for collaboration between e-science facilities, computer science and social sciences, nevertheless every collaboration does have costs: it requires time in planning and communication. Furthermore, collaborators support each other’s work often at the costs advancing their own research