Seismic Method Estimate velocity from seismic data.pptx
ย
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?
1. What can we learn from topic
modeling on 350M documents?
William Gunn
Head of Academic Outreach
Mendeley
@mrgunn โ https://orcid.org/0000-0002-3555-2054
2. Based in London, Mendeley is
researchers, graduates and software
developers from...
3. The opposite problem
๏ We have the papers (400M) and are
looking for the best way to turn
them into structured knowledge.
๏ We have useful triage indicators -
#altmetrics, reproducibility
๏ You have great use cases
4. ...and aggregates
data in the cloud
Mendeley
extracts
research dataโฆ
Collecting rich signals
from domain experts.
6. TEAM Project
academic knowledge management solutions
โข Algorithms to determine the content similarity of academic papers
โข Performing text disambiguation and entity recognition to
differentiate between and relate similar in-text entities and authors
of research papers.
โข Developing semantic technologies and semantic web languages with
the focus of metadata integration/validation
โข Investigate profiling and user analysis technologies, e.g. based on
search logs and document interaction.
โข We will also improve folksonomies and through that, ontologies of
text.
โข Finally, tagging behaviour will be analysed to improve tag
recommendations and strategies.
โข http://team-project.tugraz.at/blog/
7. Semantics vs. Syntax
โข Language expresses semantics via syntax
โข Syntax is all a computer sees in a research
article.
โข How do we get to semantics?
โขTopic Modeling!
8. Distribution of Topics
35%
30%
25%
20%
15%
10%
5%
0%
Bio Phys Engineer Comp
Sci
Psych &
Edu
Business Law Other
18. Code Project
Use case = mining research papers for facts
to add to LOD repositories and light-weight
ontologies.
โข Crowd-sourcing enabled semantic enrichment & integration
techniques for integrating facts contained in unstructured
information into the LOD cloud
โข Federated, provenance-enabled querying methods for fact
discovery in LOD repositories
โข Web-based visual analysis interfaces to support human based
analysis, integration and organisation of facts
โข Socio-economic factors โ roles, revenue-models and value
chains โ realisable in the envisioned ecosystem.
โข http://code-research.eu/
23. We didn โt
see that a target is
more likely to be validated if it
was reported in ten publications
or in two publications
NATURE REVIEWS DRUG DISCOVERY 10, 712 (SEPTEMBER 2011)
24. Either the results were reproducible
and showed transferability in other
models, or even a 1:1 reproduction of
published experimental procedures
revealed inconsistencies between
published and in-house data
NATURE REVIEWS DRUG DISCOVERY 10, 712 (SEPTEMBER 2011)
25. There is no Gold Standard
๏ Amgen: 47 of 53 โlandmarkโ oncology publications could
not be reproduced.
๏ Bayer: 43 of 67 oncology & cardiovascular projects were
based on contradictory results
๏ Dr. John Ioannidis: 432 publications purporting sex
differences in hypertension, multiple sclerosis, or lung
cancer. Only one data set was reproducible.
26. Building a reproducibility dataset
โข Mendeley and Science Exchange have
started the Reproducibility Initiative
โข working with Figshare & PLOS to host data
& replication reports
โข building open datasets backing high-impact
work
โข extending the โexecutable paperโ concept
to biomedical research
27. Make it porous & part of the
web.
๏ฌ Our success as a crowdsourcing platform
is largely due to our openness & end-user
usefulness.
๏ฌ Communities must be open if they are to
thrive.