Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

•Download as PPTX, PDF•

0 likes•368 views

William Gunn

Mendeley Talk

Science

What can we learn from topic
modeling on 350M documents?
William Gunn
Head of Academic Outreach
Mendeley
@mrgunn – https://orcid.org/0000-0002-3555-2054

Based in London, Mendeley is
researchers, graduates and software
developers from...

The opposite problem
 We have the papers (400M) and are
looking for the best way to turn
them into structured knowledge.
 We have useful triage indicators -
#altmetrics, reproducibility
 You have great use cases

...and aggregates
data in the cloud
Mendeley
extracts
research data…
Collecting rich signals
from domain experts.

TEAM Project
academic knowledge management solutions
• Algorithms to determine the content similarity of academic papers
• Performing text disambiguation and entity recognition to
differentiate between and relate similar in-text entities and authors
of research papers.
• Developing semantic technologies and semantic web languages with
the focus of metadata integration/validation
• Investigate profiling and user analysis technologies, e.g. based on
search logs and document interaction.
• We will also improve folksonomies and through that, ontologies of
text.
• Finally, tagging behaviour will be analysed to improve tag
recommendations and strategies.
• http://team-project.tugraz.at/blog/

Semantics vs. Syntax
• Language expresses semantics via syntax
• Syntax is all a computer sees in a research
article.
• How do we get to semantics?
•Topic Modeling!

Distribution of Topics
35%
30%
25%
20%
15%
10%
5%
0%
Bio Phys Engineer Comp
Sci
Psych &
Edu
Business Law Other

Subcategories of Comp. Sci.
20%
15%
10%
5%
0%
AI HCI Info Sci Software
Eng
Networks

Categorization As A Process
Thing
Process
Reaction
Catalysis
Enzymatic

Code Project
Use case = mining research papers for facts
to add to LOD repositories and light-weight
ontologies.
• Crowd-sourcing enabled semantic enrichment & integration
techniques for integrating facts contained in unstructured
information into the LOD cloud
• Federated, provenance-enabled querying methods for fact
discovery in LOD repositories
• Web-based visual analysis interfaces to support human based
analysis, integration and organisation of facts
• Socio-economic factors – roles, revenue-models and value
chains – realisable in the envisioned ecosystem.
• http://code-research.eu/

We didn ’t
see that a target is
more likely to be validated if it
was reported in ten publications
or in two publications
NATURE REVIEWS DRUG DISCOVERY 10, 712 (SEPTEMBER 2011)

Either the results were reproducible
and showed transferability in other
models, or even a 1:1 reproduction of
published experimental procedures
revealed inconsistencies between
published and in-house data
NATURE REVIEWS DRUG DISCOVERY 10, 712 (SEPTEMBER 2011)

There is no Gold Standard
 Amgen: 47 of 53 “landmark” oncology publications could
not be reproduced.
 Bayer: 43 of 67 oncology & cardiovascular projects were
based on contradictory results
 Dr. John Ioannidis: 432 publications purporting sex
differences in hypertension, multiple sclerosis, or lung
cancer. Only one data set was reproducible.

Building a reproducibility dataset
• Mendeley and Science Exchange have
started the Reproducibility Initiative
• working with Figshare & PLOS to host data
& replication reports
• building open datasets backing high-impact
work
• extending the “executable paper” concept
to biomedical research

Make it porous & part of the
web.
 Our success as a crowdsourcing platform
is largely due to our openness & end-user
usefulness.
 Communities must be open if they are to
thrive.

www.mendeley.com
william.gunn@mendeley.com
@mrgunn

What's hot

Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Tom Plasterer

RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...National Information Standards Organization (NISO)

Michener-institutional and subject-specific data repositories-nfdp13DataDryad

RDAP 15: “This is just for me”: Researchers on their data documentation pract...ASIS&T

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...National Information Standards Organization (NISO)

Creating impact with accessible data in agriculture and nutrition: sharing da...godanSec

Recommendations for selection process automation in systematic reviewsFaisal Razzak

Policy-compliant data processing: RDF-based restrictions for data-protectionSven Lieber

NIH BD2K DataMed metadata model - Force11, 2016Susanna-Assunta Sansone

Advancing Biomedical Knowledge Reuse with FAIRMichel Dumontier

FAIR principles and metrics for evaluationMichel Dumontier

FAIR Data Knowledge GraphsTom Plasterer

OpenAIRE-COAR conference 2014: Next generation metrics of scholarly performa...OpenAIRE

Knowledge Graph Semantics/InteroperabilityJames Hendler

RDAP14 Poster: openICPSR: a public access repository for storing and sharing ...ASIS&T

The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth

RDAP 033111Philip Bourne

AHM 2014: OceanLink, Smart Data versus Smart Applications EarthCube

RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkASIS&T

What's hot (20)

Making Data FAIR (Findable, Accessible, Interoperable, Reusable)

RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...

Michener-institutional and subject-specific data repositories-nfdp13

RDAP 15: “This is just for me”: Researchers on their data documentation pract...

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Creating impact with accessible data in agriculture and nutrition: sharing da...

Recommendations for selection process automation in systematic reviews

Policy-compliant data processing: RDF-based restrictions for data-protection

NIH BD2K DataMed metadata model - Force11, 2016

Advancing Biomedical Knowledge Reuse with FAIR

FAIR principles and metrics for evaluation

FAIR Data Knowledge Graphs

OpenAIRE-COAR conference 2014: Next generation metrics of scholarly performa...

Knowledge Graph Semantics/Interoperability

RDAP14 Poster: openICPSR: a public access repository for storing and sharing ...

The Roots: Linked data and the foundations of successful Agriculture Data

RDAP 033111

AHM 2014: OceanLink, Smart Data versus Smart Applications

RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework

Similar to Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

VIVO 2013 Topic Modeling Entity ExtractionWilliam Gunn

Martone gretheMaryann Martone

Linked Open Data_mlanet13Kristi Holmes

Why would a publisher care about open data?Anita de Waard

Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble

A Big Picture in Research Data ManagementCarole Goble

ELSS use cases and strategyAnton Yuryev

Managing 'Big Data' in the social sciences: the contribution of an analytico-...CILIP MDG

FAIR BioData ManagementUlrike Wittig

Research Objects: more than the sum of the partsCarole Goble

Session 0.0 poster minutes madnesssemanticsconference

OpenMinTeD: Making Sense of Large Volumes of Dataopenminted_eu

Managing Metadata for Science and Technology Studies: the RISIS caseRinke Hoekstra

Mtsr2015 goble-keynoteCarole Goble

ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble

Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe

Omics Logic - Bioinformatics 2.0Elia Brodsky

Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Carole Goble

A Clean Slate?Herbert Van de Sompel

Acs denver dirks potenzone 30 aug2011Rudy Potenzone

Similar to Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents? (20)

VIVO 2013 Topic Modeling Entity Extraction

Martone grethe

Linked Open Data_mlanet13

Why would a publisher care about open data?

Being FAIR: FAIR data and model management SSBSS 2017 Summer School

A Big Picture in Research Data Management

ELSS use cases and strategy

Managing 'Big Data' in the social sciences: the contribution of an analytico-...

FAIR BioData Management

Research Objects: more than the sum of the parts

Session 0.0 poster minutes madness

OpenMinTeD: Making Sense of Large Volumes of Data

Managing Metadata for Science and Technology Studies: the RISIS case

Mtsr2015 goble-keynote

ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...

Data Harmonization for a Molecularly Driven Health System

Omics Logic - Bioinformatics 2.0

Trust and Accountability: experiences from the FAIRDOM Commons Initiative.

A Clean Slate?

Acs denver dirks potenzone 30 aug2011

Recently uploaded

Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha

CELL -Structural and Functional unit of life.pdfNistarini College, Purulia (W.B) India

Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi

Clean In Place(CIP).pptx .Poonam Aher Patil

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293

FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson

Introduction,importance and scope of horticulture.pptxBhagirath Gogikar

Factory Acceptance Test( FAT).pptx .Poonam Aher Patil

Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi

Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju

Proteomics: types, protein profiling steps etc.Silpa

Formation of low mass protostars and their circumstellar disksSérgio Sacani

GBSN - Biochemistry (Unit 1)Areesha Ahmad

COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa

Seismic Method Estimate velocity from seismic data.pptxAlMamun560346

Recently uploaded (20)

Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000

CELL -Structural and Functional unit of life.pdf

Pests of mustard_Identification_Management_Dr.UPR.pdf

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.

Clean In Place(CIP).pptx .

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

Introduction,importance and scope of horticulture.pptx

Factory Acceptance Test( FAT).pptx .

Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking

Pests of cotton_Sucking_Pests_Dr.UPR.pdf

Proteomics: types, protein profiling steps etc.

Formation of low mass protostars and their circumstellar disks

GBSN - Biochemistry (Unit 1)

COST ESTIMATION FOR A RESEARCH PROJECT.pptx

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...

Seismic Method Estimate velocity from seismic data.pptx

Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

1. What can we learn from topic modeling on 350M documents? William Gunn Head of Academic Outreach Mendeley @mrgunn – https://orcid.org/0000-0002-3555-2054

2. Based in London, Mendeley is researchers, graduates and software developers from...

3. The opposite problem  We have the papers (400M) and are looking for the best way to turn them into structured knowledge.  We have useful triage indicators - #altmetrics, reproducibility  You have great use cases

4. ...and aggregates data in the cloud Mendeley extracts research data… Collecting rich signals from domain experts.

5. Rich user profile data

6. TEAM Project academic knowledge management solutions • Algorithms to determine the content similarity of academic papers • Performing text disambiguation and entity recognition to differentiate between and relate similar in-text entities and authors of research papers. • Developing semantic technologies and semantic web languages with the focus of metadata integration/validation • Investigate profiling and user analysis technologies, e.g. based on search logs and document interaction. • We will also improve folksonomies and through that, ontologies of text. • Finally, tagging behaviour will be analysed to improve tag recommendations and strategies. • http://team-project.tugraz.at/blog/

7. Semantics vs. Syntax • Language expresses semantics via syntax • Syntax is all a computer sees in a research article. • How do we get to semantics? •Topic Modeling!

8. Distribution of Topics 35% 30% 25% 20% 15% 10% 5% 0% Bio Phys Engineer Comp Sci Psych & Edu Business Law Other

9. Subcategories of Comp. Sci. 20% 15% 10% 5% 0% AI HCI Info Sci Software Eng Networks

10.

11. Generated topics – Comp. Sci.

12. Generated Topics - Biology

13. Categorization is imperfect

14. Categorization As A Process Thing Process Reaction Catalysis Enzymatic

15. Categorization As A Process Thing Process Reaction Catalysis Enzymatic

16. Categories change over time

17. Can we assist triage?

18. Code Project Use case = mining research papers for facts to add to LOD repositories and light-weight ontologies. • Crowd-sourcing enabled semantic enrichment & integration techniques for integrating facts contained in unstructured information into the LOD cloud • Federated, provenance-enabled querying methods for fact discovery in LOD repositories • Web-based visual analysis interfaces to support human based analysis, integration and organisation of facts • Socio-economic factors – roles, revenue-models and value chains – realisable in the envisioned ecosystem. • http://code-research.eu/

19.

20.

21.

22. Metrics as a discovery tool

23. We didn ’t see that a target is more likely to be validated if it was reported in ten publications or in two publications NATURE REVIEWS DRUG DISCOVERY 10, 712 (SEPTEMBER 2011)

24. Either the results were reproducible and showed transferability in other models, or even a 1:1 reproduction of published experimental procedures revealed inconsistencies between published and in-house data NATURE REVIEWS DRUG DISCOVERY 10, 712 (SEPTEMBER 2011)

25. There is no Gold Standard  Amgen: 47 of 53 “landmark” oncology publications could not be reproduced.  Bayer: 43 of 67 oncology & cardiovascular projects were based on contradictory results  Dr. John Ioannidis: 432 publications purporting sex differences in hypertension, multiple sclerosis, or lung cancer. Only one data set was reproducible.

26. Building a reproducibility dataset • Mendeley and Science Exchange have started the Reproducibility Initiative • working with Figshare & PLOS to host data & replication reports • building open datasets backing high-impact work • extending the “executable paper” concept to biomedical research

27. Make it porous & part of the web.  Our success as a crowdsourcing platform is largely due to our openness & end-user usefulness.  Communities must be open if they are to thrive.

28. www.mendeley.com william.gunn@mendeley.com @mrgunn

Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Similar to Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents? (20)

More from William Gunn

More from William Gunn (20)

Recently uploaded

Recently uploaded (20)

Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?