Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Fletcher Series. 2016 Aug 26;1(1-10)
Abstracts Matter. But...
How much so?
Rascon CA1
1cynthia.alexander@gmail.com, San Fr...
Biotech papers have a steady
trending curve
Figure 1. Number of citations per paper by year of publishing. The corpus data...
A journal prestige is dependent
on its impact factor
Figure 2. Journals used for the data set and the number of citations ...
Figure 3. Final set of 134,374 papers (1995-2010). The
total number of citations per paper, (target, y), was
binned in two...
LAS, Tf-idf, and Positional Tagging
selected as star features, with Random
Forests as the model of choiceR
Figure 4. ROC a...
Model over the last 5 years (2005-2009)
to predict the ‘success’ of 2010 papers:R
Figure 5. ROC and Precision/Recall curve...
Features identified as important by RF for
predicting coming years’ papers success:
Figure 6. Feature importances as ranke...
Abstracts matter about:
81%
Need to consider:
Are better scientist simply better communicators?
Or… Great scientist are al...
Abstracts matter about:
81%
Future directions:
Multi-class case
Extend prediction forecast window. 2017??
Examine those ab...
Prochain SlideShare
Chargement dans…5
×

Paper Abstracts Matter... But How much?

One week project, out of curiosity: This presentation analyzes more than 300,000 abstracts from PubMed to obtain common themes and trends in BioTech research. The bulk of this analysis was performed using natural language processing (NLP), and machine learning (ML) on the titles and abstract contents.

I was able to derive that a paper's abstract alone is very predictive of future impact (by citation count).

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Paper Abstracts Matter... But How much?

  1. 1. Fletcher Series. 2016 Aug 26;1(1-10) Abstracts Matter. But... How much so? Rascon CA1 1cynthia.alexander@gmail.com, San Francisco CA, 94105, USA. Abstractff The number of times a scientific paper is cited (citations count) has emerged as proxy of a paper’s success within its field. Here, I aim to address how relevant is an abstract to a scientific publication, and furthermore which features of such abstracts play the largest impact in a paper’s success (as estimated by citations count). The data set comprised all abstracts of scientific papers from 22 top biotech journals published in the period of 1995-2016, a total of 310,175 papers. Journals name or the affiliation of the heads of laboratories where not incorporated in this model, which aimed to be solely based on the abstracts title and content. Data cleaning, and feature engineering largely relying on NLP metrics (LSA, Tf-idf, POS-tagger), gave an good insight on what better predicts citation count across the
  2. 2. Biotech papers have a steady trending curve Figure 1. Number of citations per paper by year of publishing. The corpus data set after cleaning is comprised by 202,173 abstracts. Each cyan dot represents a single paper (transparency 0.3).
  3. 3. A journal prestige is dependent on its impact factor Figure 2. Journals used for the data set and the number of citations per paper published between 1995-2010 shown as a violin plot. This differences reflect to some extent each journals impact factor (the yearly average number of citations).
  4. 4. Figure 3. Final set of 134,374 papers (1995-2010). The total number of citations per paper, (target, y), was binned in two classes: under or over 10 total citations since the paper’s publishing date (0 or 1, respectively). (left side: Example of an Abstract and citation count) . Abstracts binned in two classes: 0 for 1-9 (25%), or 1 for 10 or more (75%) total citations
  5. 5. LAS, Tf-idf, and Positional Tagging selected as star features, with Random Forests as the model of choiceR Figure 4. ROC and Precision/Recall curves for the top performing models.
  6. 6. Model over the last 5 years (2005-2009) to predict the ‘success’ of 2010 papers:R Figure 5. ROC and Precision/Recall curves for the top performing models. This time modeling on 2005-2009 papers to predict 2010 papers ‘success’.
  7. 7. Features identified as important by RF for predicting coming years’ papers success: Figure 6. Feature importances as ranked by Random Forests, for a model trained on 2005-2009 and tested on 2010 papers. *Abstract LSA (100 comp.), **Abstract LSA on Tfidf (100 comp.), *** in Title LSA C2- ** C2- * C4- * C7- ** C4- ** POS tag ‘:’ C8- ** C5- ** Abstract length C3- ** C1- * C31-*** C15- ** C15- * C14- * C16- ** C3- * C6- * POS tag ‘.’ C29- ** 1st – Next Generation Sequencing sequenc: 0.20, method: 0.17, data: 0.16, genom: 0.16, avail: 0.14 2nd – Cellular regulation / gene expression cell: 0.71, activ: 0.19, induc: 0.08, regul: 0.08, mice: 0.07 3rd – Cellular models (methods) cell: 0.28, use: 0.23, data: 0.19, method: 0.17, model: 0.16 4th – Applied genomics (mutants) genom: 0.25, sequenc: 0.25, protein: 0.19,mutant: 0.12, human: 0.11 5th – Basic research (DNA related) gene: 0.28, dna: 0.27, rna: 0.20, transcript: 0.20, genom: 0.17
  8. 8. Abstracts matter about: 81% Need to consider: Are better scientist simply better communicators? Or… Great scientist are also really good at communicating? I did not incorporate a feature to account for novelty. (quite the opposite) It is circular to say the more papers exist in a filed the more likely it is to be cited in the future. However this suggests that trends exist in academia. *duh*
  9. 9. Abstracts matter about: 81% Future directions: Multi-class case Extend prediction forecast window. 2017?? Examine those abstracts in which the model did poorly. Flask app to ‘score’ new abstracts. Time series, model topic trends over time. Is it too early or is it too late for a paper to come out?

    Soyez le premier à commenter

    Identifiez-vous pour voir les commentaires

One week project, out of curiosity: This presentation analyzes more than 300,000 abstracts from PubMed to obtain common themes and trends in BioTech research. The bulk of this analysis was performed using natural language processing (NLP), and machine learning (ML) on the titles and abstract contents. I was able to derive that a paper's abstract alone is very predictive of future impact (by citation count).

Vues

Nombre de vues

456

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

1

Actions

Téléchargements

3

Partages

0

Commentaires

0

Mentions J'aime

0

×