4. Technology is changing how we structure work
Then
Notepads, notebooks
Books
cd, dvd, itunes
Focus groups
Classrooms/lectures
Now
Evernote
eBooks
Spotify, youtube (streamed)
Affectiva
Khan Academy
7. Changing Policy
Emerging trend of journals and publishers linking to openaccess data repositories
Journals and funding agencies setting policy to preserve
and associate data supporting research results
Open Access
7
8. Changing Global Research Patterns
The center of gravity of the world system of scholarship
is moving from west to east.
NOTES: Asia-10 includes
China, Japan, India,
Indonesia, Malaysia,
Philippines, Singapore,
South Korea, Taiwan, and
Thailand.
SOURCE: National
Science Board, Science
and Engineering Indicators
2010
8
9. Citations of U.S. research articles in non-U.S.
literature, by region/country: 1998–2010
Asia-8 = India, Indonesia, Malaysia, Philippines, Singapore, South
Korea, Taiwan, Thailand; EU = European Union
10. Share of citations to
international literature: 2000–10
Asia-8 =
India, Indonesia, Malaysia, Philippines, Singapore, South
Korea, Taiwan, Thailand; EU = European Union
11. Citations from Asia 10 Articles
NOTES: Asia-10 includes China, Japan, India, Indonesia, Malaysia, Philippines,
Singapore, South Korea, Taiwan, and Thailand. Asia-8 excludes China and Japan.
SOURCE: National Science Board, Science and Engineering Indicators 2010
11
12. US Academic Expenditures on Research by
Area, FY2008
(Millions of current dollars)
SOURCE: National Science
Foundation/Division of Science
Resources Statistics, Survey of
Research and Development
Expenditures at Universities and
Colleges: FY 2008.
12
13. Master’s degrees conferred
Modern Languages
1%
Other
12%
Biological sciences
2%
Visual/Performing Arts
2%
Computer science
3%
Education
29%
Social Sciences
3%
Psychology
5%
Public Administration
5%
Engineering
5%
Health profession
9%
Business
25%
13
17. Academic work is social
2006 Univ. of Minn.
Study
68% - Faculty work
collaboratively
52% - Collaborate with
colleagues at other
institutions
46% - Find the distance
from colleagues is a
collaboration obstacle
“One group of experts
can’t do everything”
SOURCE: Newman M E J PNAS 2001;98:404-409
18. Sharing
Low threshold (at good
enough)
Astrophysicists
arXiv
Political scientists
SSRN
Economists
High threshold (competitive)
Historians
Molecular and cell
biologists
Archaeologists
Biochemists
SSRN, NBER,
Performers/Composers
Junior faculty in all fields are especially
cautious for fear of theft and/or
misinterpretation.
18
20. Changing Environment of Research
Past
Present
Future
Book
Aggregation
Book & Journal
Aggregation
Electronic Information
Aggregation
Other sources
Other sources
Other sources
LIBRARY
The “center” of research is shifting from libraries to other sources
Books
on shelves
LIBRARY
LIBRARY
Databases
20
24. Context
Activities are the
context for when our
content is used in
research
Research mash-ups
“whole is greater than
the parts”
Critical for ecosystem
of research
25. Research Network Connections
Era of connections
Social networks
Professional networks
Connected information
Connected concepts
Connected meaning
30. Evolution of search
Catalog/index
Database/Search Engine
Xpath/XQuery
We are here
Subject
indexing
of objects
Full text
search
A&I
search
Search Technology
Semantic
and
machine
search
Structural
and
network
search
Information Structure
Hand
crafted
Metadata
Content
Text
Content
XML
31. Search features
Going from metadata about objects and text search to
ideas, context and mining
Semantic Search
Greater granularity of discovery
Structural analysis of content
Precision search
Semantic search
Internationalization of search
Translated search
32. Semantic Search
Semantic Search utilizes robust data structures like
ontologies to apply domain knowledge to otherwise twodimensional terms.
The application of word context provides a dynamic
aspect to semantic search, allowing the user’s real-time
intent to guide results.
Contrast with static thesauri and controlled vocabularies
which miss nuances of context and intent.
32
33. Automated processing of library
content
PubMed contains ~17,787,763 articles to
date
Manually searching is tedious and
frustrating
Can be hard finding links between data
and articles
Conclusion? Machines will be reading the
library.
Using MyExperiment
Workflows, researcher Paul Fisher found
Link between cholesterol, patient trauma
and parasite resistance in cattle.
33
34. Arrowsmith LBD: the ABC Model
Articles about an AB relationship
A
Raynaud’s syndrome
AB
B
blood viscosity
etc.
BC
C
dietary fish oil
Articles about a BC relationship
AB and BC are complementary but disjoint : They can reveal an implicit
relationship between A and C in the absence of any explicit relation.
The researcher assesses titles in the B literature identified by the system
for fit or contribution to problem.
34
36. Purpose of Content Analytics
Consumptive use
informs
delivers ideas.
Analytical use
inspires or proves ideas
The true center of
research.
They occupy different points in the
scholarly information lifecycle.
36
37. What does Content Analytics do?
Collaborative man-machine exploration
highlights trends, clues or anomalies [visualization – leverages
cognitive skills].
On demand Analysis.
identify and quantify trends, relationships, concepts and
correlations. (tools: SEASR, nltk , autonomy, … )
Continuous Analytics
generate new ‘facets’ or annotations for discovery [augments
content].
Preserves value
Older content is read lessii, but remains important for trend
analysis and statistical significance. [value shifts]
37
38. Examples of text as data
Changes in word sense ( e.g. consumption( TB )
, moot, oratio1 ) and spelling (e.g. 18th C. ſ to s , *re
*er )
Bibliometrics and other usage analyses
Citation patterns
Institution vs. discipline
Author demographics
Pharma: Drug / Symptom correlation.
Biology: Species / date / location observations.
Social Sci: Work/life habits of undergrads based on
access patterns at different institutions [ usage data based]
…
38
39. Text Mining
Unstructured text to queryable data structures
WHY?
TOO MUCH TEXT TO HAND ANALYZE.
Improved discovery ( better ‘metadata’ )
Business Intelligence
e.g. content stats -> content acquisitions
Saleable datasets
E.g. Distribution of authors vs. disciplines vs. grants
End User research agendas
High-End : Custom (user specified) mining as a service
Simple : Visualization of results ( frequency / co-occurrence … )
39
40. Options for use of text mining
Many, many options – it is about capability
Generic ‘improvements’
disambiguation of people, places and events ()
concept labeling ( different terms, same ‘thing’)
Corpus specific
E.g. extract institutions from theses (demographics).
Discipline specific
E.g. taxon labeling in biology.
Discovery tools , e.g.
topic labeling (`natural` disciplines)
reading level (grade, UG, PG … )
Structural analysis ( important parts of doc )
Boilerplate vs. ‘meat’
40
42. Datasets: Factoids & point data
ca. 1.4M Faculty ( 50% full-time ) in US HE, ~75M people enrolled in US HE
ca. 100k Faculty in UK HE
44% of Researchers use online (other people’s) datasets for their research
48% of Researchers use datasets > 1GB
10.8% store their data outside their institution ( 50% store it in their “lab”)
1 - 5% of datasets are formally moved into the curation process.
66% of faculty have requested other people’s data ( and 49% of those got it).
[ 26.5% have the expertise to analyze their own data.
[ 80.3% do not have sufficient expertise to manage their own data
Institutional storage costs ~ $600 / TB / year
[ 58% is the annual increase in the amount of data being generated
[ 20-40% is annual growth in the amount of storage deployed (est.)
< 1% of ecological data is accessible after publication.
> 85% of all information is in text form
2.7 times more citations accrue to papers with accessible data
3 to 6 times more papers emerge if the data is accessible.
42
43. Drivers of change
Nearly ubiquitous high-speed wireless globally
Inexpensive devices/apps/services
Global technology innovation
Policy shifts in Academia
Internationalization of scholarship
Growth in primary source datasets
Fearless and connected entrepreneurs
Fearless and connected researchers
43
2010 CSHE paper “Assessing the future landscape of scholarly communications”, Harley, Diane et al.
Comments: Nature 464, 466 (25 March 2010) http://www.nature.com/nature/journal/v464/n7288/full/464466a.html
A sample of what a very simple text analysis API can look like – plot the occurrence of ‘malivinas’ over time. The subtlety is the linear decrease in interest after the post war spike.
Trying to make it very clear that datasets are a different and more central element of all scholarly research (with the possible exception of maths, philosophy and religion). Data both inspires and confirms ideas – text is mostly informative, rarely inspirational.Highlights are the discussion pointsii. In the physical sciences the half-life of content access frequency is ~ 6-8 years.Grey text is a PQ business value, not a user value.
Whilst content can be obfuscated or reduced, there are thorny issues with usage data. Early policy decisions need to be taken with respect to exposing usage data, even indirectly ( triangulation is always possible ).--1 Oratio has shifted from ‘speech’ to ‘prayer’ and back again in the latin literature. See Greg Crane et al.
Note that the number of articles is small anyway, so the data could simply be random variation. This is way too simple a tool for serious analysis.
Figures on faculty demographics from http://nces.ed.gov/programs/digest/d09Sources in earlier paper on datasets.