SlideShare une entreprise Scribd logo
1  sur  81
Publish or Perish:
Towards a Ranking of Scientists using
    Bibliographic Data Mining
                 Lior Rokach
Department of Information Systems Engineering
     Ben-Gurion University of the Negev
About Me

Prof. Lior Rokach
Department of Information Systems Engineering
Faculty of Engineering Sciences
Head of the Machine Learning Lab
Ben-Gurion University of the Negev

Email: liorrk@bgu.ac.il
http://www.ise.bgu.ac.il/faculty/liorr/

PhD (2004) from Tel Aviv University
Outline:
•   What is bibliometrics?
•   Short tutorial on bibiometrics measures
•   Our methodology: data mining
•   Task 1: Academic positions
•   Task 2: AAAI Fellowship
•   Results
•   Conclusions
Ranking scientists, WHY?
•   Promotion
•   Tenure
•   Grants
•   Prizes
Bibliometrics

• “Man is an animal that writes letters”

      – Attributed to Lewis Carroll (Charles Dodgson)

• Scientist is an animal that writes papers

• Bibliometrics is measurement of (scientific) publications

• The simplest measure – Number of publications -
  Disadvantage: counts Quantity and disregards Quality
Publish or Perish




“I don‟t mind your thinking slowly. I mind your
  publishing faster than you can think.”
(The Nobel Laureates physicist Wolfgang Pauli)
Metrics: Do metrics matter?
• According to Abbott et al.
  (Nature, 2010):
  – Department heads says ―No‖
     • ―External letters trump everything,‖
  – But …
     • Admit that ―those „qualitative‟ letters
       of recommendation sometimes bring in
       quantitative metrics by the back door‖
     • Most of the researchers (70%) believe
       it has an effect
Quick Guide To Bibliometrics
         Measures
Citation Index
A citation index is an index of citations
 between publications, allowing the user to
 easily establish which later documents cite
 which earlier documents
The First Citation Index
                                                                 Cited by




The first citation index is attributed to the Hebrew Talmud (see above),
  Dated th Centaury (Weinberg, 1997), while other refer to Shepard's
  Citations created in 1873 as the first citation index.
Simple Citations-Based Measures
       to Evaluate Scientists
• Total Citations (and its squared root)

• Total Citations normalized by number of authors

• Mean number of citations per year

• Mean number of citations per paper
Why citations are not always ideal way
 to evaluate researchers 'publications
• Uncitedness: It is a sobering fact that some 90% of articles that have
  been published in academic journals are never cited. Even Nobel
  Laureates have a rather large fraction (10% or more) of uncited
  publications (Egghe et al., 2011).
• But the terms ―uncited‖ or ―seldom cited,‖ they are usually referring
  to uncited or seldom-cited in the journals monitored by Thomson
  Reuters and other similar databases, not to all journals, books, and
  reports;
• ―uncited‖ or ―seldom-cited‖ is not a synonym for ―not used.‖
  (MacRoberts MacRoberts, 2011)
• Expert judgment is the best, and in the last resort the only, criterion of
  performance,
A Brief History of Citation Analysis
• 1955:
   – Eugene Garfield - Linguist
   – Develop the impact factor.
   – Founder of the Institute for Scientific Information (ISI)
• 1997:
   – Lee Giles; Kurt D. Bollacker; Steve Lawrence
   – Crawl and harvest papers on the web
   – Focus mainly on CS
• 2004:
   – ―Stand on the shoulders of giants‖
   – Freely accessible web search engine for scholarly literature
• 2005:
   – Jorge E. Hirsch – Physicist
   – Develop the h-Index
• 2007:
   – Carl Bergstrom – Biologist
   – Establish http://eigenfactor.org/
   – Use PageRank algorithm to rank journals
1. Impact Factor (Garfield, 1955)
• Citation Indexes for Science: A New Dimension in
  Documentation through Association of Ideas
   – Garfield, E., Science, 1955, 122, 108-111

• The impact factor for each journal, as used by Thomson
  Scientific, is the average number of citations acquired during
  the past two years for papers published over the same period.

―The 2007 Impact factor for journal ABC‖ =
           Number of times articles published in ABC during
        2005-2006 were cited in indexed journals during 2007
    –––––––––-–––––––––––––––––––––––––––––––––––––––––
   Number of ―citable‖ articles published by ABC in 2005 and 2006
Criticisms of the Impact Factor
• Subject variation: citation studies should be normalized to
  take into account variables such as field, discipline etc.
• Long Tail: individual papers is largely uncorrelated to the
  impact factor of the journal in which it was published.
• Limited subset of journals are indexed
• Biased toward English-language journals
• Short (two year) snapshot of journal
• Includes self-citations
• Some journals are unfairly promoting their own papers
• Journal Inclusion Criteria are more than just quality
Variations of Impact Factor and more:
• Five years Impact Factor
• Cited Half-Life - measure the achievability. The Cited Half-Life of journal J in year
   X is the number of years after which 50% of the lifetime citations of J‘s content published in
   X have been received.
• Ranking - Journals are often ranked by impact factor in an appropriate ThomsonReuters
   subject category. journals can be categorised in multiple subject categories which will cause
   their rank to be different and consequently a rank should always be in context to the subject
   category being utilised.


Other Journal Ranking:
• Eigenfactor - similar algorithm as Google‘s PageRank
    – By this approach, journals are considered to be influential if they are cited often by other
      influential journals.
    – Removes self-citations
    – Looks at five years of data
2. H-Index
    (Hirsch, 2005; Egghe and Rousseau, 2006)

• A scientist is said to have Hirsch index h if h of their
  total, N, papers have at least h citations each
• Using H-Index for Physicists by Hirsh:
  – 10-12  tenure decisions
  – 18  a full professorship
  – 15–20  a fellowship in the American
    Physical Society
  – 45 or higher  membership in the United
    States National Academy of Sciences.
• H-Index in IS (Clarke, 2008)
  – Using Google Scholar
h ~ mn
    (m=gradient, n=number of years)
1. m ~ 1, h=20 after 20 years ―Successful Scientists―
2. m ~ 2, h=40 after 20 years ―outstanding scientists―
3. m ~ 3, h=60 (20 years) or h=90 (30 years) ―truly unique
individuals‖

Physics Nobel prizes (last 20 years)
      ‗h‘ (median) = 35
      84 % had ‗h‘ ≥ 30
49 % had m < 1
Modified H-Index Metrics
           Scientists with the same H-Index
Measure    Description                                                                    Ref
Rational   It first calculate how many new citations are needed to increase the h-        Ruane and Tol
H-Index    index by one point. Let m denote the additional points needed. Thus the        (2008)
Distance   rational hD=h1+1-m/(2h+1).
Rational   A researcher has an h-index of h if h is the largest number of papers with     Ruane and Tol
H-Index    at least h citations. However, some researchers may have more than h           (2008)
X          papers, say n, with at least h citations. Let us define x= n-h. Thus the
           rational H-Index become hX=h+x/(s-h) where s is the total number of
           publications.
e-index    The (square root) of the surplus of citations in the h-set beyond h^2, i.e.,   Chun-Ting
           beyond the theoretical minimum required to obtain a h-index of 'h'. The        Zhang (2009)
           aim of the e-index is to differentiate between scientists with similar h-
           indices but different citation patterns.
Modified H-Index Metrics
             To share the fame in a fair way
              multi-authored manuscripts
Measure      Description                                                                  Ref
Individual   It divides the standard h-index by the average number of authors in the      Batista et al.
h-index      articles that contribute to the h-index, in order to reduce the effects of   2006
             co-authorship;
Norm         It first normalizes the number of citations for each paper by dividing
Individual   the number of citations by the number of authors for that paper, then
h-index      calculates hI,norm as the h-index of the normalized citation counts.
             This approach is much more fine-grained than Batista et al.'s; it more
             accurately accounts for any co-authorship effects that might be present
             and that it is a better approximation of the per-author impact, which is
             what the original h-index set out to provide
Schreiber    Schreiber's method uses fractional paper counts (for example, only as        Schreiber
Individual   one third for three authors.) instead of reduced citation counts to          (2008)
h-index      account for shared authorship of papers, and then determines the multi-
             authored hm index based on the resulting effective rank of the papers
             using undiluted citation counts.
Modified H-Index Metrics
                        Age Adjusted
Measure        Description                                                                             Ref
Contemporary   It adds an age-related weighting to each cited article less weight to older articles.   Sidiropoulos et
h-index        The weighting is parametrized; If we use gamma=4 and delta=1, this means that           al. (2006)
               for an article published during the current year, its citations account four times.
               For an article published 4 years ago, its citations account only one time. For an
               article published 6 years ago, its citations account 4/6 times, and so on.
AR-index       It is an age-weighted citation rate, where the number of citations to a given paper     Jin (2007)
               is divided by the age of that paper. Jin defines the AR-index as the square root of
               the sum of all age-weighted citation counts over all papers that contribute to the
               h-index.
AWCR           Like AR-index but sum over all papers instead (In particular, it allows younger
               and as yet less cited papers to contribute to the AWCR, even though they may
               not yet contribute to the h-index.)
Revised H-Index Metrics
                         Others
Measure    Description                                                          Ref
AWCRpA The per-author age-weighted citation rate is similar to the plain
       AWCR, but is normalized to the number of authors for each
       paper.
g-Index    Given a set of articles ranked in decreasing order of the number     Leo Egghe
           of citations that they received, the g-index is the (unique)         (2006)
           largest number such that the top g articles received (together) at
           least g^2 citations. It aims to improve on the h-index by giving
           more weight to highly-cited articles.
Pi-index   The pi-index is equal to one hundredth of the number of              Vinkler
           citations obtained to the top square root of the total number of     (2009)
           journal papers (‗elite set of papers‘) ranked by the decreasing
           number of citations.
Modified H-Index Metrics
           Scientists with the same H-Index
Measure    Description                                                                    Ref
Rational   It first calculate how many new citations are needed to increase the h-        Ruane and Tol
H-Index    index by one point. Let m denote the additional points needed. Thus the        (2008)
Distance   rational hD=h1+1-m/(2h+1).
Rational   A researcher has an h-index of h if h is the largest number of papers with     Ruane and Tol
H-Index    at least h citations. However, some researchers may have more than h           (2008)
X          papers, say n, with at least h citations. Let us define x= n-h. Thus the
           rational H-Index become hX=h+x/(s-h) where s is the total number of
           publications.
e-index    The (square root) of the surplus of citations in the h-set beyond h^2, i.e.,   Chun-Ting
           beyond the theoretical minimum required to obtain a h-index of 'h'. The        Zhang (2009)
           aim of the e-index is to differentiate between scientists with similar h-
           indices but different citation patterns.
Modified H-Index Metrics
             To share the fame in a fair way
              multi-authored manuscripts
Measure      Description                                                                  Ref
Individual   It divides the standard h-index by the average number of authors in the      Batista et al.
h-index      articles that contribute to the h-index, in order to reduce the effects of   2006
             co-authorship;
Norm         It first normalizes the number of citations for each paper by dividing
Individual   the number of citations by the number of authors for that paper, then
h-index      calculates hI,norm as the h-index of the normalized citation counts.
             This approach is much more fine-grained than Batista et al.'s; it more
             accurately accounts for any co-authorship effects that might be present
             and that it is a better approximation of the per-author impact, which is
             what the original h-index set out to provide
Schreiber    Schreiber's method uses fractional paper counts (for example, only as        Schreiber
Individual   one third for three authors.) instead of reduced citation counts to          (2008)
h-index      account for shared authorship of papers, and then determines the multi-
             authored hm index based on the resulting effective rank of the papers
             using undiluted citation counts.
Modified H-Index Metrics
                        Age Adjusted
Measure        Description                                                                             Ref
Contemporary   It adds an age-related weighting to each cited article less weight to older articles.   Sidiropoulos et
h-index        The weighting is parametrized; If we use gamma=4 and delta=1, this means that           al. (2006)
               for an article published during the current year, its citations account four times.
               For an article published 4 years ago, its citations account only one time. For an
               article published 6 years ago, its citations account 4/6 times, and so on.
AR-index       It is an age-weighted citation rate, where the number of citations to a given paper     Jin (2007)
               is divided by the age of that paper. Jin defines the AR-index as the square root of
               the sum of all age-weighted citation counts over all papers that contribute to the
               h-index.
AWCR           Like AR-index but sum over all papers instead (In particular, it allows younger
               and as yet less cited papers to contribute to the AWCR, even though they may
               not yet contribute to the h-index.)
Revised H-Index Metrics
                         Others
Measure    Description                                                          Ref
AWCRpA The per-author age-weighted citation rate is similar to the plain
       AWCR, but is normalized to the number of authors for each
       paper.
g-Index    Given a set of articles ranked in decreasing order of the number     Leo Egghe
           of citations that they received, the g-index is the (unique)         (2006)
           largest number such that the top g articles received (together) at
           least g^2 citations. It aims to improve on the h-index by giving
           more weight to highly-cited articles.
Pi-index   The pi-index is equal to one hundredth of the number of              Vinkler
           citations obtained to the top square root of the total number of     (2009)
           journal papers (‗elite set of papers‘) ranked by the decreasing
           number of citations.
Limitations of H-Index
• The h-index ignores the importance of the publications
   – Évariste Galois' h-index is 2, and will remain so forever.


   – Had Albert Einstein died in early 1906, his h-index would be
     stuck at 4 or 5, despite his high reputation at that date.
• Ignore context of citations:
   – Some papers are cited to flesh-out the introduction (related
     work)
   – Some citations made in a negative context
• Gratuitous authorship
Education Subject Category…
Eigenfactor.org Scores
• Eigenfactor score: …the higher the better
   – A measure of the overall value provided by all of the articles published
     in a given journal in a year; accounts for difference in prestige among
     citing journals. A measure of the journal‘s total importance to the
     scientific community.
   – Eigenfactor scores are scaled so that the sum of the Eigenfactor scores
     of all journals listed in Thomson‘s Journal Citation Reports (JCR) is
     100.
• Article Influence score: … the higher the better
   – Article Influence measures the average influence, per article, of the
     papers in a journal. As such, it is comparable to the Impact Factor.
   – Article Influence scores are normalized so that the mean article in the
     entire Thomson Journal Citation Reports (JCR) database has an article
     influence of 1.00.
   – Still, it‘s best to ―compare‖ within subjects.
• Cost effectiveness: … the lower the better
   – price / eigenfactor [2006 data]
Other Journal Ranking Efforts…
SCImago Journal Rank (SJR)
  Similar to eigenfactor methods, but based on
    citations in Scopus
  – Freely available at scimagojr.com
  – More journals (~13,500]
  – More international diversity
  – Uses PageRank algorithm (like eigenfactor.org)
  – 3 years of citations; no self-citations
  – But: Scopus only has citations back to ~1995
SCImago
SCImago Journal Indicator Search…
SCImago Journal Search (Agronomy
Journal)
A Few Other Journal Ranking
      Proposals… many would like to use
                 journal usage stats
• Usage Factors – Based on journal usage
  (COUNTER stats [Counting Online Usage of
  Networked Electronic Resources]) uksg.org/usagefactors/final
• Y factor, a combination of both the impact
  factor and the weighted page rank developed
  by Google (Bollen et al., 2006)
• MESUR: MEtrics from Scholarly Usage of
  Resources – Uses citations & COUNTER
  stats
  http://www.mesur.org/MESUR.html
Other Measures for Evaluating
        Researchers (Tang, et al. 2008)
• Uptrend - Nothing can catch people's eyes more than a rising star.
  Uptrend measures are used to define the rising degree of a researcher.
• The information of each author‘s paper including the published date
  and conference's impact factor. We use Least Squares Method to fit a
  curve from published papers in recent N years. Then we use the curve
  to predict one's score in the next year, which is defined as the score of
  Uptrend, formally
Other Measures for Evaluating
      Researchers (Tang, et al. 2008)
• Activity - People's activity is simply defined based
  on one's papers published in the last years. We
  consider the importance of each paper and thus
  define the activity score as:
Other Measures for Evaluating
      Researchers (Tang, et al. 2008)
• Diversity - Generally, an expert's research may
  include several different research fields. Diversity is
  defined to quantitatively reflect the degree. In
  particular, we first use the author-conference-topic
  model (Tang, et al. 2008) to obtain the research
  fields for each expert.
Other Measures for Evaluating
      Researchers (Tang, et al. 2008)
• Sociability - The score of sociability is basically
  defined based on how many coauthors an expert
  has. We define the score as :

• where #copaperc denotes the number of papers
  coauthored between the expert and the coauthor c. In
  the next step, we will further consider the location,
  organization, nationality information, and research
  fields.
Richard Van Noorden (2010)
Bibliometrics Predictive Power

• Prediction of Nobel Laureates –
        – The Thomson Reuters rank among the top 0.1% of
          researchers in their fields, based on citations of their
          published papers over the last two decades.
        – Since 2002, of those named Thomson Reuters Citation
          Laureates, 12 have gone on to win Nobel Prizes.


• Jensen et al. (2009) used measurements to predict
  which f the CNRS researchers will be promoted:
     • h index leads to 48% of ―correct‖ promoted scientists
     • number of citations gives 46%
     • number of published papers only 42%.
Research Questions

• Primary Questions:
  – To which extent do bibliometrics reflect scientists
    ranking in CS?
  – Which single measure is the best predictor?
  – How should different measures be combined?
• Secondary Questions:
  – Which type of manuscripts should be taken into
    consideration?
  – Does Self-Citation really matter?
  – Which citation index is better?
Research Methods
• Retrospective analysis of scientists‘ careers:
   – Correlating academic positions with bibliometrics
     values that evolve as time goes by.
   – AAAI Fellowship
• Using Data Mining Techniques for building:
   – A snapshot classifier for ranking scientists to their
     academic position.
   – A decision making model for promoting scientists.
   – A classifier for deciding who should be awarded the
     AAAI Fellowship each year.
• Comparative analysis
Process
ISI Web of Knowledge
• Coverage
  – Most Journals (13,000 journals)
  – Some Conferences (192,000 conference proceedings)
  – Almost no Books (5,000 books)
  – All patents (23 million patents)
  – 256 subject categories in Science, Social Sciences, and Arts and Humanities,
    covering the full range of scholarship and research
  – Many citations (716 million) Only Citations that are fully match are
• Accuracy
  – Very few errors
  – Very few missing values
  – No Duplications
Google Scholar
• Coverage
  – The largest
  – Still has limited coverage of pre-1990 publications
  – It is criticized for including gray literature in its citation
    counts (Sanderson, 2008)

• Accuracy
  – Missing values
  – Wrong values
  – Duplicate entries
Why CS?
• Variety of sub-fields with different citation patterns
  (Bioinformatics vs AI).

• Different types of important manuscripts (Journal,
  Conferences, Books, Chapters, Patents, etc).

• Evolving field (senior professors completed their PhD in
  other fields).

• We are personally interested in this field
Task 1: Nominating Committee
Inclusion/Exclusion Criteria
47 Researchers
   –   Researchers from Stanford, MIT, Berkley and Yale
   –   Completed their PhD after 1970
   –   Researcher name can be disambiguated
   –   CV:
        • Promotion years are known
        • No short-cut in the career.
   – Total of 724 ―research years‖.
• ISI - Total number of items: 50K (2300 written
  by the targeted researchers).
• Google Scholar - Total number of items: 300K
H-Index Over Time (for 7 professors)
                                                               Drop Page Fields Here

                        ISI H- INDEX
                   18


                   16


                   14

Name
                   12
       BEJERANO
       DEVADAS
                   10
       GIFFORD
       GOLDBERG
                   8
       HUDAK
       SUDAN
                   6
       TENENBAUM

                   4


                   2


                   0
                         0    1    2   3   4   5   6   7   8     9   10   11   12     13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28


                                                                                    Years from Phd
Citations Over Time (for 7 professors)
                                                                        Drop Page Fields Here

                          Average of ISIfalsefalse0totalCitations
                   1000


                   900


                   800

Name               700
       BEJERANO
                   600
       DEVADAS
       GIFFORD
                   500
       GOLDBERG
       HUDAK       400
       SUDAN
       TENENBAUM   300

                   200


                   100


                     0
                           0     1    2     3    4     5    6       7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28


                                                                                           Years from Phd
Evaluation
• Procedure: Leave One Researcher Out
                                                   ln(odds) b wT x
• Base Classifier – Logistics Regression
                                                           1
                                                    p
• Publication type                                      1 e   b -wT x

   – All – All
   – All – Journals
   – Journals - Journals
• Self-Citations:
   – All
   – Self-Citation 1 (the target researcher is not one of the authors)
   – Self-Citations 2 (no overlap between original set of authors
     and the citing paper)
Task 1.1: Ranking Researchers
• Rank a researcher to one the following positions,
  given only a snapshot of her bibliometrics
  measures:
   –   Post
   –   Assistant
   –   Associate
   –   Full
• Note that we are not aware to scientist previous
  position or seniority.
• Default accuracy = 35%

                                                      Full
                             Assistant   Associate

                    Post
The Ranking Task – Results
                      Top 10 Measures
Classification            Cited Manuscript   Citing Manuscript   Self-Citation
Accuracy         Source       Type                Type            Level              Measure
   59.95%         ISI         Journal             Journal              1              g-Index
   59.30%         ISI         Journal             Journal              0              g-Index
   59.30%         ISI         Journal             Journal              2              g-Index

   58.65%         ISI            All              Journal              0           Norm h-index

   58.65%         ISI            All              Journal              1           Norm h-index

   58.65%         ISI            All              Journal              2           Norm h-index

   58.00%         ISI         Journal             Journal              1           Norm h-index

   57.74%         ISI         Journal             Journal              0           Norm h-index

   57.74%         ISI         Journal             Journal              2           Norm h-index

   57.48%        Google       Journal             Journal              2         Rational H Index X
The Ranking Task – Results
                   Least Predictive Measures
                              Cited
 Classification             Manuscript      Citing Manuscript Self-Citation
   Accuracy       Source      Type                 Type           Level           Measure
    37.06%        Google      Journal               *               *          # Publications
                                                                                Individual #
    37.06%        Google      Journal               *               *           Publications
    37.19%        Google      Journal            Journal            0         Schreiber h-index
    38.10%        Google         All               All              1         Individual h-index
    38.10%        Google         All               All              2         Individual h-index
    38.10%          ISI          All               All              1         Schreiber h-index
    38.23%          ISI          All               All              0         Schreiber h-index
    38.23%          ISI          All               All              2         Schreiber h-index
    38.75%          ISI          All             Journal            0         Schreiber h-index
    38.75%          ISI          All             Journal            2         Schreiber h-index
* Statistical significance has been found
Not by bibliometrics alone
                            Accuracy = 73.7% !!!
                                Predicted
                     Full   Associate Assistant   Post
                       0        0        56        3       Post




                                                                     Actual
                       0       36       167        15    Assistant
                      29       145       31        1     Associate
                     252       31         3        0       Full




    Years from PhD
Task 1.2: Promoting Researchers
• Given the researcher‘s current position and
  her bibliometrics measures, decide if she
  should be promoted.

• Measure the absolute deviation in years
  from the actual promotion time.
Promotion Decision Task - Results
                                                                       Cited
                                                              Self
                                                                       Manuscript     Citing
                                                           Citations                Manuscript
Measure                   Calculated as           Source    Level      Type           Type       Assistant   Associate   Full   Average

Rational H-Index 1        Absolute Value          Google           1   All          Journal        1.26         1.58     1.88    1.51

Total Citations           Change from Last Rank   Google           0   Journal      All            1.26         1.68     1.88    1.55

Total Citations           Change from Last Rank   Google           2   Journal      All            1.26         1.68     1.88    1.55

Total Citations           Change from Last Rank   Google           1   Journal      All            1.26         1.71     1.88    1.56



Norm Individual H-Index   Change from Last Rank   Google               All          Journal        1.28         1.74     1.79    1.56

…                         …                       …        …           …            …               …            …        …       …

Individual H Index        Change from Last Rank   Google           1   Journal      Journal        1.30         2.03     2.38    1.80

Contemporary H Index      Absolute Value          Google           1   Journal      All            1.46         2.00     2.17    1.81



      * No statistical significance has been found
      * About 2% of the cases, our system has not recommended to promote a researcher
      although this promotion actually took place.
Not by bibliometrics alone
                                            Improvement vs. Rank
                  25.00%

                  20.00%

                  15.00%

                  10.00%

                   5.00%

                   0.00%
                             1. Assistant                2. Associate               3. Full
                  -5.00%

                 -10.00%

                 -15.00%

                 -20.00%

                 -25.00%

                 -30.00%
                                                 Measure                Assistant        Associate   Full   Average

Promoted to Associate-6 years from PhD           Rational H-Index 1        1.26               1.58   1.88     1.51
Promoted to Full   –13years from PhD
                                                 Years from Phd            1.02               1.72   2.38     1.45
Google Scholar vs. ISI Thomson
Google Scholar vs. ISI Thomson
Self-Citations
Which Manuscripts Should be Taken into
          Consideration?
Which Citing Manuscripts Should be
    Taken into Consideration?
Conclusions – Take 1
• Seniority is a good indicator for
  promoting scientists in leading USA
  universities.
• Variation in bibliometrics among
  scientists slightly contribute to the
  promotion timing.
• No significant difference between ISI
  and Google
• Self-Citation is not so important
• After all, journals are more reliable
  than other publications.
Task 2: And the AAAI
 Fellowship Goes To
AAAI Fellowsihp
Try to determine if and when an AI scientist is
  qualified to be elected to the AAAI Fellowship
Data set:
  – 92 researchers that won the award from 1995 to
    2009 only
  – 200 randomly selected AI researchers with at least 5
    papers in top tier AI Journals/Conferences
  – Using ISI data.
     • Google Scholar Coming soon
Task 2.1 – Leave One Scientist Out




                            Criterion                        Average Performance
            Not Identifying a fellow (False Negative)                21%
         Wrongly identifying a non-fellow (False Positive)          8.2%
Using a single measure

                                                                              Fellows




                              H-Index

                       Criterion                        Average Performance
       Not Identifying a fellow (False Negative)                48%
    Wrongly identifying a non-fellow (False Positive)          6.1%
Task 2.2 – Predicting Next Year Fellows
Task 2.2 – Predicting Coming Fellows
Rules Example
• (TC/A = '(65.7085-inf)') and (TP/A = '(26.084-inf)') and (Ih = '(3.565-inf)')
  and (CpY = '(13.191-inf)') => FellowWon=TRUE (49.0/5.0)
• (Pi = '(0.645-inf)') and (AWCR = '(1.0555-3.6035]') and (TC/A = '(80.875-
  inf)') => FellowWon=TRUE (29.0/3.0)
• (TP = '(7.5-inf)') and (e = '(6.595-inf)') and (TP = '(47.5-inf)') and (AWCR =
  '(1.0735-3.849]') and (AWCRpA = '(2.1705-inf)') and (SIh = '(0.5-3.5]') =>
  FellowWon=TRUE (18.0/1.0)
• …
Task 2.3 – Social Network
• Based on the idea of Erdos
  number
• Predict fellowship based
  on co-authorship with
  other fellows.
• http://academic.research.m
  icrosoft.com/VisualExplor
  er.aspx#1802181&84132
Task 2.3




                   Criterion                        Average Performance
   Not Identifying a fellow (False Negative)                52%
Wrongly identifying a non-fellow (False Positive)          6.6%

                                   +
                   Criterion                        Average Performance
   Not Identifying a fellow (False Negative)                21%
Wrongly identifying a non-fellow (False Positive)          8.2%

                                   =
                   Criterion                        Average Performance
   Not Identifying a fellow (False Negative)                16%
Wrongly identifying a non-fellow (False Positive)          5.9%
Task 2.3
• (Count >= 5) and (CpP >= 7) and (TP/A >=
  6.883) => Fellow=TRUE (51.0/3.0)
• (TP/A >= 22.944) and (Avg <= 3.266667) and
  (TP <= 40) => Fellow=TRUE (23.0/3.0)
• (Count >= 5) and (e >= 7.071) and (CpP <=
  1.618) => Fellow=TRUE (11.0/1.0)
• …
Conclusions – Take 2
• Bibliometric measures can be used to
  predict fellowship
• Combining various measures using data
  nining techniques improve prediction power
• Co-authorship relations can slightly boost
  the accuracy
Very Near Future Work
• Adding Google scholar dataset
• Examine the contribution of conferences in
  predicting the fellowship.
• Tell Me Who Cite You, …
Why God Never Received
           Tenure at Any University
1) He had only one major publication.
2) It was in Hebrew.
3) It had no references.
4) It wasn't published in a refereed journal.
5) Some even doubt he wrote it himself.
6) It may be true that he created the world, but what has he done since then?
7) His cooperative efforts have been quite limited.
8) The scientific community has had a hard time replicating his results.
9) He never applied to the Ethics Board for permission to use human subjects.
10) When an experiment went awry, he tried to cover it up by drowning the subjects.
11) When subjects didn't behave as predicted, he deleted them from the sample.
12) He rarely came to class, just told students to read the book.
13) Some say he had his son teach the class.
14) He expelled his first two students for learning.
15) Although there were only ten requirements, most students failed his tests.
16) His office hours were infrequent and usually held on a mountaintop.
•
                                                 References
    JOHAN BOLLEN, MARKO A. RODRIGUEZ, HERBERT VAN DE SOMPEL, Journal status, Scientometrics, Vol. 69, No. 3 (2006) 669-
    687
•   Christenson J A, Sigelman L. Accrediting knowledge: Journal stature and citation impact in social science. Soc. Sci. Quart. 66:964-
    75, 1985.
•   RAAN, A. F. J, VAN (2006), Performance-related differences of bibliometric statistical properties of research groups: cumulative
    advantages and hierarchically layered networks, Journal of the American Society for Information Science and Technology, 57 (14) : 1919–
    1935.
•   EPSTEIN, D. (2007), Impact factor manipulation. The Write Stuff, 16 : 133–134.
•   ANTONIA ANDRADE, RAÚL GONZÁLEZ-JONTE, JUAN MIGUEL CAMPANARIO, Journals that increase their impact factor at
    least fourfold in a few years: The role of journal self-citations, Scientometrics, Vol. 80, No. 2 (2009) 517—530
•   Peter Vinkler, The pi-index: a new indicator for assessing scientific impact, Journal of Information Science, Vol. 35, No. 5, 602-612
    (2009)
•   Peter Vinkler, An attempt for defining some basic categories of scientometrics and classifying the indicators of evaluative
    scientometrics, Scientometrics, Vol. 50, No. 3 (2001) 539-544
•   Peter Jacso, Testing the Calculation of a Realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster, LIBRARY
    TRENDS, Vol. 56, No. 4, Spring 2008 pp. 784-815
•   R. K. Merton, ―The Matthew Effect in Science,‖ Science, vol. 159, no. 3810, pp. 56–63, January 1968.
•   J. Beel and B. Gipp, ―The Potential of Collaborative Document Evaluation for Science,‖ in 11th International Conference on Digital
    Asian Libraries (ICADL'08), ser. Lecture Notes in Computer Science (LNCS), G. Buchanan, M. Masoodian, and S. J.
    Cunningham, Eds., vol. 5362. Heidelberg (Germany): Springer, December 2008, pp. 375–378.
•   Tang, J. and Zhang, J. and Yao, L. and Li, J. and Zhang, L. and Su, Z., Arnetminer: Extraction and mining of academic social
    networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 990--
    998, 2008, ACM.
•   B H Weinberg, The Earliest Hebrew Citation Indexes, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE.
    48(4):318–330, 1997
•   Richard Van Noorden (2010), A profusion of measures, Nature Vol 465
•   Leo Egghe, Raf Guns, Ronald Rousseau(2011), Thoughts on Uncitedness: Nobel Laureates and Fields Medalists as Case Studies
•   M.H. MacRoberts and B.R. MacRoberts, Problems of Citation Analysis: A Study of Uncited and Seldom-Cited Influences (2011)

Contenu connexe

Tendances

Scientific Conduct- Ethics, Intellectual honesty & research integrity.pdf
Scientific Conduct- Ethics, Intellectual honesty & research integrity.pdfScientific Conduct- Ethics, Intellectual honesty & research integrity.pdf
Scientific Conduct- Ethics, Intellectual honesty & research integrity.pdfDr. Mahesh Koltame
 
Ethical Guidelines for Academic Publishing.pptx
Ethical Guidelines for Academic Publishing.pptxEthical Guidelines for Academic Publishing.pptx
Ethical Guidelines for Academic Publishing.pptxOsama Alam
 
Research Ethics and Integrity: How COPE can help
Research Ethics and Integrity: How COPE can helpResearch Ethics and Integrity: How COPE can help
Research Ethics and Integrity: How COPE can helpC0pe
 
RESEARCH METHODOLOGY AND BIOSTATISTICS : UNIT-IV: Medical fatality i
RESEARCH METHODOLOGY AND BIOSTATISTICS : UNIT-IV: Medical fatality iRESEARCH METHODOLOGY AND BIOSTATISTICS : UNIT-IV: Medical fatality i
RESEARCH METHODOLOGY AND BIOSTATISTICS : UNIT-IV: Medical fatality iASHISHSUTTEE
 
Importance of publication ethics
Importance of publication ethicsImportance of publication ethics
Importance of publication ethicsKmTriptiSingh
 
COMPLAINTS AND APPEALS in Research examples from abroad
COMPLAINTS AND APPEALS in Research examples from abroadCOMPLAINTS AND APPEALS in Research examples from abroad
COMPLAINTS AND APPEALS in Research examples from abroadtp jayamohan
 
ETHICAL ISSUES IN BIOMEDICAL RESEARCH
ETHICAL ISSUES IN BIOMEDICAL RESEARCHETHICAL ISSUES IN BIOMEDICAL RESEARCH
ETHICAL ISSUES IN BIOMEDICAL RESEARCHHealth Forager
 
Selective Reporting and Misrepresentation of Data
Selective Reporting and Misrepresentation of DataSelective Reporting and Misrepresentation of Data
Selective Reporting and Misrepresentation of DataSaptarshi Ghosh
 
Predatory Publications and Software Tools for Identification
Predatory Publications and Software Tools for IdentificationPredatory Publications and Software Tools for Identification
Predatory Publications and Software Tools for IdentificationSaptarshi Ghosh
 
Medical Ethics
Medical EthicsMedical Ethics
Medical EthicsRuchiPal10
 
World university ranking systems
World university ranking systemsWorld university ranking systems
World university ranking systemsKhalid Mahmood
 

Tendances (20)

Scientific Conduct- Ethics, Intellectual honesty & research integrity.pdf
Scientific Conduct- Ethics, Intellectual honesty & research integrity.pdfScientific Conduct- Ethics, Intellectual honesty & research integrity.pdf
Scientific Conduct- Ethics, Intellectual honesty & research integrity.pdf
 
Duplicate publications and simultaneous submissions
Duplicate publications and simultaneous submissionsDuplicate publications and simultaneous submissions
Duplicate publications and simultaneous submissions
 
Ethical Guidelines for Academic Publishing.pptx
Ethical Guidelines for Academic Publishing.pptxEthical Guidelines for Academic Publishing.pptx
Ethical Guidelines for Academic Publishing.pptx
 
Research Ethics and Integrity: How COPE can help
Research Ethics and Integrity: How COPE can helpResearch Ethics and Integrity: How COPE can help
Research Ethics and Integrity: How COPE can help
 
RESEARCH METHODOLOGY AND BIOSTATISTICS : UNIT-IV: Medical fatality i
RESEARCH METHODOLOGY AND BIOSTATISTICS : UNIT-IV: Medical fatality iRESEARCH METHODOLOGY AND BIOSTATISTICS : UNIT-IV: Medical fatality i
RESEARCH METHODOLOGY AND BIOSTATISTICS : UNIT-IV: Medical fatality i
 
Al in healthcare
Al in healthcareAl in healthcare
Al in healthcare
 
Medical ethics notes
Medical ethics notesMedical ethics notes
Medical ethics notes
 
Impact factors
Impact factorsImpact factors
Impact factors
 
Translating research into action
Translating research into actionTranslating research into action
Translating research into action
 
Importance of publication ethics
Importance of publication ethicsImportance of publication ethics
Importance of publication ethics
 
COMPLAINTS AND APPEALS in Research examples from abroad
COMPLAINTS AND APPEALS in Research examples from abroadCOMPLAINTS AND APPEALS in Research examples from abroad
COMPLAINTS AND APPEALS in Research examples from abroad
 
Clinical Research In India
Clinical Research In IndiaClinical Research In India
Clinical Research In India
 
ETHICAL ISSUES IN BIOMEDICAL RESEARCH
ETHICAL ISSUES IN BIOMEDICAL RESEARCHETHICAL ISSUES IN BIOMEDICAL RESEARCH
ETHICAL ISSUES IN BIOMEDICAL RESEARCH
 
Selective Reporting and Misrepresentation of Data
Selective Reporting and Misrepresentation of DataSelective Reporting and Misrepresentation of Data
Selective Reporting and Misrepresentation of Data
 
Predatory Publications and Software Tools for Identification
Predatory Publications and Software Tools for IdentificationPredatory Publications and Software Tools for Identification
Predatory Publications and Software Tools for Identification
 
ALTMETRICS
ALTMETRICSALTMETRICS
ALTMETRICS
 
Medical Ethics
Medical EthicsMedical Ethics
Medical Ethics
 
World university ranking systems
World university ranking systemsWorld university ranking systems
World university ranking systems
 
Unit 2
Unit 2Unit 2
Unit 2
 
Ugc care 2020
Ugc care 2020Ugc care 2020
Ugc care 2020
 

Similaire à Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data Mining

Measuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equalMeasuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equalAndre Vellino
 
Durham Leading Research Module 13 (Bibliometrics and Altmetrics)
Durham Leading Research Module 13 (Bibliometrics and Altmetrics)Durham Leading Research Module 13 (Bibliometrics and Altmetrics)
Durham Leading Research Module 13 (Bibliometrics and Altmetrics)Jamie Bisset
 
Durham Leading Research Programme: Academic Impact
Durham Leading Research Programme: Academic ImpactDurham Leading Research Programme: Academic Impact
Durham Leading Research Programme: Academic ImpactJamie Bisset
 
JournalPublication101_ItsAJungleOutThere.pdf
JournalPublication101_ItsAJungleOutThere.pdfJournalPublication101_ItsAJungleOutThere.pdf
JournalPublication101_ItsAJungleOutThere.pdfssuserf7dad3
 
term paper presentation (1) (1).pptx
term paper presentation (1) (1).pptxterm paper presentation (1) (1).pptx
term paper presentation (1) (1).pptxicchapipesh
 
Journal Impact Factors and Citation Analysis
Journal Impact Factors and Citation AnalysisJournal Impact Factors and Citation Analysis
Journal Impact Factors and Citation Analysisrepayne
 
Cocitation Networks and Random Walk
Cocitation Networks and Random WalkCocitation Networks and Random Walk
Cocitation Networks and Random WalkURFIST de Paris
 
How to prepare a research paper and its evaluation tools
How to prepare a research paper and its evaluation toolsHow to prepare a research paper and its evaluation tools
How to prepare a research paper and its evaluation toolsMohanapriya Suresh
 
RESEARCH METRICES 25.11.21.pptx
RESEARCH METRICES  25.11.21.pptxRESEARCH METRICES  25.11.21.pptx
RESEARCH METRICES 25.11.21.pptxmahitha22
 
Journal Metrics: The Impact Factor and Everything Else
Journal Metrics: The Impact Factor and Everything ElseJournal Metrics: The Impact Factor and Everything Else
Journal Metrics: The Impact Factor and Everything ElseWiley-Blackwell Compass
 
Research impact 2013 jan
Research impact 2013 janResearch impact 2013 jan
Research impact 2013 janbellalli
 
Bibliometrics: From Garfield to Google Scholar
Bibliometrics: From Garfield to Google ScholarBibliometrics: From Garfield to Google Scholar
Bibliometrics: From Garfield to Google ScholarElaine Lasda
 
RESEARCH METRICES 25.11.21.pptx
RESEARCH METRICES  25.11.21.pptxRESEARCH METRICES  25.11.21.pptx
RESEARCH METRICES 25.11.21.pptxmahitha22
 
CL8 Scientiometrics Module 6 RPE-Rijo TKMCE.pdf
CL8 Scientiometrics Module 6 RPE-Rijo TKMCE.pdfCL8 Scientiometrics Module 6 RPE-Rijo TKMCE.pdf
CL8 Scientiometrics Module 6 RPE-Rijo TKMCE.pdfssuserb76cdd
 
What do we know about the h index?
What do we know about the h index?What do we know about the h index?
What do we know about the h index?hsls
 
Talk pg ibnu_sina_2nov2011
Talk pg ibnu_sina_2nov2011Talk pg ibnu_sina_2nov2011
Talk pg ibnu_sina_2nov2011Nisrin Manz
 
Indexing and Citations Metrics: your guide for prospective research
Indexing and Citations Metrics: your guide for prospective researchIndexing and Citations Metrics: your guide for prospective research
Indexing and Citations Metrics: your guide for prospective researchMostafa Nadeer Al-Emran
 

Similaire à Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data Mining (20)

Measuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equalMeasuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equal
 
Durham Leading Research Module 13 (Bibliometrics and Altmetrics)
Durham Leading Research Module 13 (Bibliometrics and Altmetrics)Durham Leading Research Module 13 (Bibliometrics and Altmetrics)
Durham Leading Research Module 13 (Bibliometrics and Altmetrics)
 
Durham Leading Research Programme: Academic Impact
Durham Leading Research Programme: Academic ImpactDurham Leading Research Programme: Academic Impact
Durham Leading Research Programme: Academic Impact
 
impact factor ,h index (1).pptx
impact factor ,h index (1).pptximpact factor ,h index (1).pptx
impact factor ,h index (1).pptx
 
JournalPublication101_ItsAJungleOutThere.pdf
JournalPublication101_ItsAJungleOutThere.pdfJournalPublication101_ItsAJungleOutThere.pdf
JournalPublication101_ItsAJungleOutThere.pdf
 
term paper presentation (1) (1).pptx
term paper presentation (1) (1).pptxterm paper presentation (1) (1).pptx
term paper presentation (1) (1).pptx
 
Journal Impact Factors and Citation Analysis
Journal Impact Factors and Citation AnalysisJournal Impact Factors and Citation Analysis
Journal Impact Factors and Citation Analysis
 
Cocitation Networks and Random Walk
Cocitation Networks and Random WalkCocitation Networks and Random Walk
Cocitation Networks and Random Walk
 
How to prepare a research paper and its evaluation tools
How to prepare a research paper and its evaluation toolsHow to prepare a research paper and its evaluation tools
How to prepare a research paper and its evaluation tools
 
Measuring your impact
Measuring your impactMeasuring your impact
Measuring your impact
 
RESEARCH METRICES 25.11.21.pptx
RESEARCH METRICES  25.11.21.pptxRESEARCH METRICES  25.11.21.pptx
RESEARCH METRICES 25.11.21.pptx
 
Journal Metrics: The Impact Factor and Everything Else
Journal Metrics: The Impact Factor and Everything ElseJournal Metrics: The Impact Factor and Everything Else
Journal Metrics: The Impact Factor and Everything Else
 
Research impact 2013 jan
Research impact 2013 janResearch impact 2013 jan
Research impact 2013 jan
 
Bibliometrics: From Garfield to Google Scholar
Bibliometrics: From Garfield to Google ScholarBibliometrics: From Garfield to Google Scholar
Bibliometrics: From Garfield to Google Scholar
 
RESEARCH METRICES 25.11.21.pptx
RESEARCH METRICES  25.11.21.pptxRESEARCH METRICES  25.11.21.pptx
RESEARCH METRICES 25.11.21.pptx
 
CL8 Scientiometrics Module 6 RPE-Rijo TKMCE.pdf
CL8 Scientiometrics Module 6 RPE-Rijo TKMCE.pdfCL8 Scientiometrics Module 6 RPE-Rijo TKMCE.pdf
CL8 Scientiometrics Module 6 RPE-Rijo TKMCE.pdf
 
What do we know about the h index?
What do we know about the h index?What do we know about the h index?
What do we know about the h index?
 
Bibliometric study
Bibliometric studyBibliometric study
Bibliometric study
 
Talk pg ibnu_sina_2nov2011
Talk pg ibnu_sina_2nov2011Talk pg ibnu_sina_2nov2011
Talk pg ibnu_sina_2nov2011
 
Indexing and Citations Metrics: your guide for prospective research
Indexing and Citations Metrics: your guide for prospective researchIndexing and Citations Metrics: your guide for prospective research
Indexing and Citations Metrics: your guide for prospective research
 

Dernier

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Dernier (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data Mining

  • 1. Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data Mining Lior Rokach Department of Information Systems Engineering Ben-Gurion University of the Negev
  • 2. About Me Prof. Lior Rokach Department of Information Systems Engineering Faculty of Engineering Sciences Head of the Machine Learning Lab Ben-Gurion University of the Negev Email: liorrk@bgu.ac.il http://www.ise.bgu.ac.il/faculty/liorr/ PhD (2004) from Tel Aviv University
  • 3. Outline: • What is bibliometrics? • Short tutorial on bibiometrics measures • Our methodology: data mining • Task 1: Academic positions • Task 2: AAAI Fellowship • Results • Conclusions
  • 4. Ranking scientists, WHY? • Promotion • Tenure • Grants • Prizes
  • 5. Bibliometrics • “Man is an animal that writes letters” – Attributed to Lewis Carroll (Charles Dodgson) • Scientist is an animal that writes papers • Bibliometrics is measurement of (scientific) publications • The simplest measure – Number of publications - Disadvantage: counts Quantity and disregards Quality
  • 6. Publish or Perish “I don‟t mind your thinking slowly. I mind your publishing faster than you can think.” (The Nobel Laureates physicist Wolfgang Pauli)
  • 7. Metrics: Do metrics matter? • According to Abbott et al. (Nature, 2010): – Department heads says ―No‖ • ―External letters trump everything,‖ – But … • Admit that ―those „qualitative‟ letters of recommendation sometimes bring in quantitative metrics by the back door‖ • Most of the researchers (70%) believe it has an effect
  • 8. Quick Guide To Bibliometrics Measures
  • 9. Citation Index A citation index is an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents
  • 10. The First Citation Index Cited by The first citation index is attributed to the Hebrew Talmud (see above), Dated th Centaury (Weinberg, 1997), while other refer to Shepard's Citations created in 1873 as the first citation index.
  • 11. Simple Citations-Based Measures to Evaluate Scientists • Total Citations (and its squared root) • Total Citations normalized by number of authors • Mean number of citations per year • Mean number of citations per paper
  • 12. Why citations are not always ideal way to evaluate researchers 'publications • Uncitedness: It is a sobering fact that some 90% of articles that have been published in academic journals are never cited. Even Nobel Laureates have a rather large fraction (10% or more) of uncited publications (Egghe et al., 2011). • But the terms ―uncited‖ or ―seldom cited,‖ they are usually referring to uncited or seldom-cited in the journals monitored by Thomson Reuters and other similar databases, not to all journals, books, and reports; • ―uncited‖ or ―seldom-cited‖ is not a synonym for ―not used.‖ (MacRoberts MacRoberts, 2011) • Expert judgment is the best, and in the last resort the only, criterion of performance,
  • 13. A Brief History of Citation Analysis • 1955: – Eugene Garfield - Linguist – Develop the impact factor. – Founder of the Institute for Scientific Information (ISI) • 1997: – Lee Giles; Kurt D. Bollacker; Steve Lawrence – Crawl and harvest papers on the web – Focus mainly on CS • 2004: – ―Stand on the shoulders of giants‖ – Freely accessible web search engine for scholarly literature • 2005: – Jorge E. Hirsch – Physicist – Develop the h-Index • 2007: – Carl Bergstrom – Biologist – Establish http://eigenfactor.org/ – Use PageRank algorithm to rank journals
  • 14. 1. Impact Factor (Garfield, 1955) • Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas – Garfield, E., Science, 1955, 122, 108-111 • The impact factor for each journal, as used by Thomson Scientific, is the average number of citations acquired during the past two years for papers published over the same period. ―The 2007 Impact factor for journal ABC‖ = Number of times articles published in ABC during 2005-2006 were cited in indexed journals during 2007 –––––––––-––––––––––––––––––––––––––––––––––––––––– Number of ―citable‖ articles published by ABC in 2005 and 2006
  • 15. Criticisms of the Impact Factor • Subject variation: citation studies should be normalized to take into account variables such as field, discipline etc. • Long Tail: individual papers is largely uncorrelated to the impact factor of the journal in which it was published. • Limited subset of journals are indexed • Biased toward English-language journals • Short (two year) snapshot of journal • Includes self-citations • Some journals are unfairly promoting their own papers • Journal Inclusion Criteria are more than just quality
  • 16. Variations of Impact Factor and more: • Five years Impact Factor • Cited Half-Life - measure the achievability. The Cited Half-Life of journal J in year X is the number of years after which 50% of the lifetime citations of J‘s content published in X have been received. • Ranking - Journals are often ranked by impact factor in an appropriate ThomsonReuters subject category. journals can be categorised in multiple subject categories which will cause their rank to be different and consequently a rank should always be in context to the subject category being utilised. Other Journal Ranking: • Eigenfactor - similar algorithm as Google‘s PageRank – By this approach, journals are considered to be influential if they are cited often by other influential journals. – Removes self-citations – Looks at five years of data
  • 17. 2. H-Index (Hirsch, 2005; Egghe and Rousseau, 2006) • A scientist is said to have Hirsch index h if h of their total, N, papers have at least h citations each
  • 18. • Using H-Index for Physicists by Hirsh: – 10-12  tenure decisions – 18  a full professorship – 15–20  a fellowship in the American Physical Society – 45 or higher  membership in the United States National Academy of Sciences. • H-Index in IS (Clarke, 2008) – Using Google Scholar
  • 19. h ~ mn (m=gradient, n=number of years) 1. m ~ 1, h=20 after 20 years ―Successful Scientists― 2. m ~ 2, h=40 after 20 years ―outstanding scientists― 3. m ~ 3, h=60 (20 years) or h=90 (30 years) ―truly unique individuals‖ Physics Nobel prizes (last 20 years) ‗h‘ (median) = 35 84 % had ‗h‘ ≥ 30 49 % had m < 1
  • 20. Modified H-Index Metrics Scientists with the same H-Index Measure Description Ref Rational It first calculate how many new citations are needed to increase the h- Ruane and Tol H-Index index by one point. Let m denote the additional points needed. Thus the (2008) Distance rational hD=h1+1-m/(2h+1). Rational A researcher has an h-index of h if h is the largest number of papers with Ruane and Tol H-Index at least h citations. However, some researchers may have more than h (2008) X papers, say n, with at least h citations. Let us define x= n-h. Thus the rational H-Index become hX=h+x/(s-h) where s is the total number of publications. e-index The (square root) of the surplus of citations in the h-set beyond h^2, i.e., Chun-Ting beyond the theoretical minimum required to obtain a h-index of 'h'. The Zhang (2009) aim of the e-index is to differentiate between scientists with similar h- indices but different citation patterns.
  • 21. Modified H-Index Metrics To share the fame in a fair way multi-authored manuscripts Measure Description Ref Individual It divides the standard h-index by the average number of authors in the Batista et al. h-index articles that contribute to the h-index, in order to reduce the effects of 2006 co-authorship; Norm It first normalizes the number of citations for each paper by dividing Individual the number of citations by the number of authors for that paper, then h-index calculates hI,norm as the h-index of the normalized citation counts. This approach is much more fine-grained than Batista et al.'s; it more accurately accounts for any co-authorship effects that might be present and that it is a better approximation of the per-author impact, which is what the original h-index set out to provide Schreiber Schreiber's method uses fractional paper counts (for example, only as Schreiber Individual one third for three authors.) instead of reduced citation counts to (2008) h-index account for shared authorship of papers, and then determines the multi- authored hm index based on the resulting effective rank of the papers using undiluted citation counts.
  • 22. Modified H-Index Metrics Age Adjusted Measure Description Ref Contemporary It adds an age-related weighting to each cited article less weight to older articles. Sidiropoulos et h-index The weighting is parametrized; If we use gamma=4 and delta=1, this means that al. (2006) for an article published during the current year, its citations account four times. For an article published 4 years ago, its citations account only one time. For an article published 6 years ago, its citations account 4/6 times, and so on. AR-index It is an age-weighted citation rate, where the number of citations to a given paper Jin (2007) is divided by the age of that paper. Jin defines the AR-index as the square root of the sum of all age-weighted citation counts over all papers that contribute to the h-index. AWCR Like AR-index but sum over all papers instead (In particular, it allows younger and as yet less cited papers to contribute to the AWCR, even though they may not yet contribute to the h-index.)
  • 23. Revised H-Index Metrics Others Measure Description Ref AWCRpA The per-author age-weighted citation rate is similar to the plain AWCR, but is normalized to the number of authors for each paper. g-Index Given a set of articles ranked in decreasing order of the number Leo Egghe of citations that they received, the g-index is the (unique) (2006) largest number such that the top g articles received (together) at least g^2 citations. It aims to improve on the h-index by giving more weight to highly-cited articles. Pi-index The pi-index is equal to one hundredth of the number of Vinkler citations obtained to the top square root of the total number of (2009) journal papers (‗elite set of papers‘) ranked by the decreasing number of citations.
  • 24. Modified H-Index Metrics Scientists with the same H-Index Measure Description Ref Rational It first calculate how many new citations are needed to increase the h- Ruane and Tol H-Index index by one point. Let m denote the additional points needed. Thus the (2008) Distance rational hD=h1+1-m/(2h+1). Rational A researcher has an h-index of h if h is the largest number of papers with Ruane and Tol H-Index at least h citations. However, some researchers may have more than h (2008) X papers, say n, with at least h citations. Let us define x= n-h. Thus the rational H-Index become hX=h+x/(s-h) where s is the total number of publications. e-index The (square root) of the surplus of citations in the h-set beyond h^2, i.e., Chun-Ting beyond the theoretical minimum required to obtain a h-index of 'h'. The Zhang (2009) aim of the e-index is to differentiate between scientists with similar h- indices but different citation patterns.
  • 25. Modified H-Index Metrics To share the fame in a fair way multi-authored manuscripts Measure Description Ref Individual It divides the standard h-index by the average number of authors in the Batista et al. h-index articles that contribute to the h-index, in order to reduce the effects of 2006 co-authorship; Norm It first normalizes the number of citations for each paper by dividing Individual the number of citations by the number of authors for that paper, then h-index calculates hI,norm as the h-index of the normalized citation counts. This approach is much more fine-grained than Batista et al.'s; it more accurately accounts for any co-authorship effects that might be present and that it is a better approximation of the per-author impact, which is what the original h-index set out to provide Schreiber Schreiber's method uses fractional paper counts (for example, only as Schreiber Individual one third for three authors.) instead of reduced citation counts to (2008) h-index account for shared authorship of papers, and then determines the multi- authored hm index based on the resulting effective rank of the papers using undiluted citation counts.
  • 26. Modified H-Index Metrics Age Adjusted Measure Description Ref Contemporary It adds an age-related weighting to each cited article less weight to older articles. Sidiropoulos et h-index The weighting is parametrized; If we use gamma=4 and delta=1, this means that al. (2006) for an article published during the current year, its citations account four times. For an article published 4 years ago, its citations account only one time. For an article published 6 years ago, its citations account 4/6 times, and so on. AR-index It is an age-weighted citation rate, where the number of citations to a given paper Jin (2007) is divided by the age of that paper. Jin defines the AR-index as the square root of the sum of all age-weighted citation counts over all papers that contribute to the h-index. AWCR Like AR-index but sum over all papers instead (In particular, it allows younger and as yet less cited papers to contribute to the AWCR, even though they may not yet contribute to the h-index.)
  • 27. Revised H-Index Metrics Others Measure Description Ref AWCRpA The per-author age-weighted citation rate is similar to the plain AWCR, but is normalized to the number of authors for each paper. g-Index Given a set of articles ranked in decreasing order of the number Leo Egghe of citations that they received, the g-index is the (unique) (2006) largest number such that the top g articles received (together) at least g^2 citations. It aims to improve on the h-index by giving more weight to highly-cited articles. Pi-index The pi-index is equal to one hundredth of the number of Vinkler citations obtained to the top square root of the total number of (2009) journal papers (‗elite set of papers‘) ranked by the decreasing number of citations.
  • 28. Limitations of H-Index • The h-index ignores the importance of the publications – Évariste Galois' h-index is 2, and will remain so forever. – Had Albert Einstein died in early 1906, his h-index would be stuck at 4 or 5, despite his high reputation at that date. • Ignore context of citations: – Some papers are cited to flesh-out the introduction (related work) – Some citations made in a negative context • Gratuitous authorship
  • 29.
  • 31. Eigenfactor.org Scores • Eigenfactor score: …the higher the better – A measure of the overall value provided by all of the articles published in a given journal in a year; accounts for difference in prestige among citing journals. A measure of the journal‘s total importance to the scientific community. – Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson‘s Journal Citation Reports (JCR) is 100. • Article Influence score: … the higher the better – Article Influence measures the average influence, per article, of the papers in a journal. As such, it is comparable to the Impact Factor. – Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports (JCR) database has an article influence of 1.00. – Still, it‘s best to ―compare‖ within subjects. • Cost effectiveness: … the lower the better – price / eigenfactor [2006 data]
  • 32. Other Journal Ranking Efforts… SCImago Journal Rank (SJR) Similar to eigenfactor methods, but based on citations in Scopus – Freely available at scimagojr.com – More journals (~13,500] – More international diversity – Uses PageRank algorithm (like eigenfactor.org) – 3 years of citations; no self-citations – But: Scopus only has citations back to ~1995
  • 35. SCImago Journal Search (Agronomy Journal)
  • 36. A Few Other Journal Ranking Proposals… many would like to use journal usage stats • Usage Factors – Based on journal usage (COUNTER stats [Counting Online Usage of Networked Electronic Resources]) uksg.org/usagefactors/final • Y factor, a combination of both the impact factor and the weighted page rank developed by Google (Bollen et al., 2006) • MESUR: MEtrics from Scholarly Usage of Resources – Uses citations & COUNTER stats http://www.mesur.org/MESUR.html
  • 37. Other Measures for Evaluating Researchers (Tang, et al. 2008) • Uptrend - Nothing can catch people's eyes more than a rising star. Uptrend measures are used to define the rising degree of a researcher. • The information of each author‘s paper including the published date and conference's impact factor. We use Least Squares Method to fit a curve from published papers in recent N years. Then we use the curve to predict one's score in the next year, which is defined as the score of Uptrend, formally
  • 38. Other Measures for Evaluating Researchers (Tang, et al. 2008) • Activity - People's activity is simply defined based on one's papers published in the last years. We consider the importance of each paper and thus define the activity score as:
  • 39. Other Measures for Evaluating Researchers (Tang, et al. 2008) • Diversity - Generally, an expert's research may include several different research fields. Diversity is defined to quantitatively reflect the degree. In particular, we first use the author-conference-topic model (Tang, et al. 2008) to obtain the research fields for each expert.
  • 40. Other Measures for Evaluating Researchers (Tang, et al. 2008) • Sociability - The score of sociability is basically defined based on how many coauthors an expert has. We define the score as : • where #copaperc denotes the number of papers coauthored between the expert and the coauthor c. In the next step, we will further consider the location, organization, nationality information, and research fields.
  • 42. Bibliometrics Predictive Power • Prediction of Nobel Laureates – – The Thomson Reuters rank among the top 0.1% of researchers in their fields, based on citations of their published papers over the last two decades. – Since 2002, of those named Thomson Reuters Citation Laureates, 12 have gone on to win Nobel Prizes. • Jensen et al. (2009) used measurements to predict which f the CNRS researchers will be promoted: • h index leads to 48% of ―correct‖ promoted scientists • number of citations gives 46% • number of published papers only 42%.
  • 43. Research Questions • Primary Questions: – To which extent do bibliometrics reflect scientists ranking in CS? – Which single measure is the best predictor? – How should different measures be combined? • Secondary Questions: – Which type of manuscripts should be taken into consideration? – Does Self-Citation really matter? – Which citation index is better?
  • 44. Research Methods • Retrospective analysis of scientists‘ careers: – Correlating academic positions with bibliometrics values that evolve as time goes by. – AAAI Fellowship • Using Data Mining Techniques for building: – A snapshot classifier for ranking scientists to their academic position. – A decision making model for promoting scientists. – A classifier for deciding who should be awarded the AAAI Fellowship each year. • Comparative analysis
  • 46. ISI Web of Knowledge • Coverage – Most Journals (13,000 journals) – Some Conferences (192,000 conference proceedings) – Almost no Books (5,000 books) – All patents (23 million patents) – 256 subject categories in Science, Social Sciences, and Arts and Humanities, covering the full range of scholarship and research – Many citations (716 million) Only Citations that are fully match are • Accuracy – Very few errors – Very few missing values – No Duplications
  • 47. Google Scholar • Coverage – The largest – Still has limited coverage of pre-1990 publications – It is criticized for including gray literature in its citation counts (Sanderson, 2008) • Accuracy – Missing values – Wrong values – Duplicate entries
  • 48. Why CS? • Variety of sub-fields with different citation patterns (Bioinformatics vs AI). • Different types of important manuscripts (Journal, Conferences, Books, Chapters, Patents, etc). • Evolving field (senior professors completed their PhD in other fields). • We are personally interested in this field
  • 49. Task 1: Nominating Committee
  • 50. Inclusion/Exclusion Criteria 47 Researchers – Researchers from Stanford, MIT, Berkley and Yale – Completed their PhD after 1970 – Researcher name can be disambiguated – CV: • Promotion years are known • No short-cut in the career. – Total of 724 ―research years‖. • ISI - Total number of items: 50K (2300 written by the targeted researchers). • Google Scholar - Total number of items: 300K
  • 51. H-Index Over Time (for 7 professors) Drop Page Fields Here ISI H- INDEX 18 16 14 Name 12 BEJERANO DEVADAS 10 GIFFORD GOLDBERG 8 HUDAK SUDAN 6 TENENBAUM 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Years from Phd
  • 52. Citations Over Time (for 7 professors) Drop Page Fields Here Average of ISIfalsefalse0totalCitations 1000 900 800 Name 700 BEJERANO 600 DEVADAS GIFFORD 500 GOLDBERG HUDAK 400 SUDAN TENENBAUM 300 200 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Years from Phd
  • 53. Evaluation • Procedure: Leave One Researcher Out ln(odds) b wT x • Base Classifier – Logistics Regression 1 p • Publication type 1 e b -wT x – All – All – All – Journals – Journals - Journals • Self-Citations: – All – Self-Citation 1 (the target researcher is not one of the authors) – Self-Citations 2 (no overlap between original set of authors and the citing paper)
  • 54. Task 1.1: Ranking Researchers • Rank a researcher to one the following positions, given only a snapshot of her bibliometrics measures: – Post – Assistant – Associate – Full • Note that we are not aware to scientist previous position or seniority. • Default accuracy = 35% Full Assistant Associate Post
  • 55. The Ranking Task – Results Top 10 Measures Classification Cited Manuscript Citing Manuscript Self-Citation Accuracy Source Type Type Level Measure 59.95% ISI Journal Journal 1 g-Index 59.30% ISI Journal Journal 0 g-Index 59.30% ISI Journal Journal 2 g-Index 58.65% ISI All Journal 0 Norm h-index 58.65% ISI All Journal 1 Norm h-index 58.65% ISI All Journal 2 Norm h-index 58.00% ISI Journal Journal 1 Norm h-index 57.74% ISI Journal Journal 0 Norm h-index 57.74% ISI Journal Journal 2 Norm h-index 57.48% Google Journal Journal 2 Rational H Index X
  • 56. The Ranking Task – Results Least Predictive Measures Cited Classification Manuscript Citing Manuscript Self-Citation Accuracy Source Type Type Level Measure 37.06% Google Journal * * # Publications Individual # 37.06% Google Journal * * Publications 37.19% Google Journal Journal 0 Schreiber h-index 38.10% Google All All 1 Individual h-index 38.10% Google All All 2 Individual h-index 38.10% ISI All All 1 Schreiber h-index 38.23% ISI All All 0 Schreiber h-index 38.23% ISI All All 2 Schreiber h-index 38.75% ISI All Journal 0 Schreiber h-index 38.75% ISI All Journal 2 Schreiber h-index * Statistical significance has been found
  • 57. Not by bibliometrics alone Accuracy = 73.7% !!! Predicted Full Associate Assistant Post 0 0 56 3 Post Actual 0 36 167 15 Assistant 29 145 31 1 Associate 252 31 3 0 Full Years from PhD
  • 58. Task 1.2: Promoting Researchers • Given the researcher‘s current position and her bibliometrics measures, decide if she should be promoted. • Measure the absolute deviation in years from the actual promotion time.
  • 59. Promotion Decision Task - Results Cited Self Manuscript Citing Citations Manuscript Measure Calculated as Source Level Type Type Assistant Associate Full Average Rational H-Index 1 Absolute Value Google 1 All Journal 1.26 1.58 1.88 1.51 Total Citations Change from Last Rank Google 0 Journal All 1.26 1.68 1.88 1.55 Total Citations Change from Last Rank Google 2 Journal All 1.26 1.68 1.88 1.55 Total Citations Change from Last Rank Google 1 Journal All 1.26 1.71 1.88 1.56 Norm Individual H-Index Change from Last Rank Google All Journal 1.28 1.74 1.79 1.56 … … … … … … … … … … Individual H Index Change from Last Rank Google 1 Journal Journal 1.30 2.03 2.38 1.80 Contemporary H Index Absolute Value Google 1 Journal All 1.46 2.00 2.17 1.81 * No statistical significance has been found * About 2% of the cases, our system has not recommended to promote a researcher although this promotion actually took place.
  • 60. Not by bibliometrics alone Improvement vs. Rank 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1. Assistant 2. Associate 3. Full -5.00% -10.00% -15.00% -20.00% -25.00% -30.00% Measure Assistant Associate Full Average Promoted to Associate-6 years from PhD Rational H-Index 1 1.26 1.58 1.88 1.51 Promoted to Full –13years from PhD Years from Phd 1.02 1.72 2.38 1.45
  • 61. Google Scholar vs. ISI Thomson
  • 62. Google Scholar vs. ISI Thomson
  • 64. Which Manuscripts Should be Taken into Consideration?
  • 65. Which Citing Manuscripts Should be Taken into Consideration?
  • 66. Conclusions – Take 1 • Seniority is a good indicator for promoting scientists in leading USA universities. • Variation in bibliometrics among scientists slightly contribute to the promotion timing. • No significant difference between ISI and Google • Self-Citation is not so important • After all, journals are more reliable than other publications.
  • 67. Task 2: And the AAAI Fellowship Goes To
  • 68. AAAI Fellowsihp Try to determine if and when an AI scientist is qualified to be elected to the AAAI Fellowship Data set: – 92 researchers that won the award from 1995 to 2009 only – 200 randomly selected AI researchers with at least 5 papers in top tier AI Journals/Conferences – Using ISI data. • Google Scholar Coming soon
  • 69. Task 2.1 – Leave One Scientist Out Criterion Average Performance Not Identifying a fellow (False Negative) 21% Wrongly identifying a non-fellow (False Positive) 8.2%
  • 70. Using a single measure Fellows H-Index Criterion Average Performance Not Identifying a fellow (False Negative) 48% Wrongly identifying a non-fellow (False Positive) 6.1%
  • 71. Task 2.2 – Predicting Next Year Fellows
  • 72. Task 2.2 – Predicting Coming Fellows
  • 73. Rules Example • (TC/A = '(65.7085-inf)') and (TP/A = '(26.084-inf)') and (Ih = '(3.565-inf)') and (CpY = '(13.191-inf)') => FellowWon=TRUE (49.0/5.0) • (Pi = '(0.645-inf)') and (AWCR = '(1.0555-3.6035]') and (TC/A = '(80.875- inf)') => FellowWon=TRUE (29.0/3.0) • (TP = '(7.5-inf)') and (e = '(6.595-inf)') and (TP = '(47.5-inf)') and (AWCR = '(1.0735-3.849]') and (AWCRpA = '(2.1705-inf)') and (SIh = '(0.5-3.5]') => FellowWon=TRUE (18.0/1.0) • …
  • 74. Task 2.3 – Social Network • Based on the idea of Erdos number • Predict fellowship based on co-authorship with other fellows. • http://academic.research.m icrosoft.com/VisualExplor er.aspx#1802181&84132
  • 75. Task 2.3 Criterion Average Performance Not Identifying a fellow (False Negative) 52% Wrongly identifying a non-fellow (False Positive) 6.6% + Criterion Average Performance Not Identifying a fellow (False Negative) 21% Wrongly identifying a non-fellow (False Positive) 8.2% = Criterion Average Performance Not Identifying a fellow (False Negative) 16% Wrongly identifying a non-fellow (False Positive) 5.9%
  • 76. Task 2.3 • (Count >= 5) and (CpP >= 7) and (TP/A >= 6.883) => Fellow=TRUE (51.0/3.0) • (TP/A >= 22.944) and (Avg <= 3.266667) and (TP <= 40) => Fellow=TRUE (23.0/3.0) • (Count >= 5) and (e >= 7.071) and (CpP <= 1.618) => Fellow=TRUE (11.0/1.0) • …
  • 77. Conclusions – Take 2 • Bibliometric measures can be used to predict fellowship • Combining various measures using data nining techniques improve prediction power • Co-authorship relations can slightly boost the accuracy
  • 78.
  • 79. Very Near Future Work • Adding Google scholar dataset • Examine the contribution of conferences in predicting the fellowship. • Tell Me Who Cite You, …
  • 80. Why God Never Received Tenure at Any University 1) He had only one major publication. 2) It was in Hebrew. 3) It had no references. 4) It wasn't published in a refereed journal. 5) Some even doubt he wrote it himself. 6) It may be true that he created the world, but what has he done since then? 7) His cooperative efforts have been quite limited. 8) The scientific community has had a hard time replicating his results. 9) He never applied to the Ethics Board for permission to use human subjects. 10) When an experiment went awry, he tried to cover it up by drowning the subjects. 11) When subjects didn't behave as predicted, he deleted them from the sample. 12) He rarely came to class, just told students to read the book. 13) Some say he had his son teach the class. 14) He expelled his first two students for learning. 15) Although there were only ten requirements, most students failed his tests. 16) His office hours were infrequent and usually held on a mountaintop.
  • 81. References JOHAN BOLLEN, MARKO A. RODRIGUEZ, HERBERT VAN DE SOMPEL, Journal status, Scientometrics, Vol. 69, No. 3 (2006) 669- 687 • Christenson J A, Sigelman L. Accrediting knowledge: Journal stature and citation impact in social science. Soc. Sci. Quart. 66:964- 75, 1985. • RAAN, A. F. J, VAN (2006), Performance-related differences of bibliometric statistical properties of research groups: cumulative advantages and hierarchically layered networks, Journal of the American Society for Information Science and Technology, 57 (14) : 1919– 1935. • EPSTEIN, D. (2007), Impact factor manipulation. The Write Stuff, 16 : 133–134. • ANTONIA ANDRADE, RAÚL GONZÁLEZ-JONTE, JUAN MIGUEL CAMPANARIO, Journals that increase their impact factor at least fourfold in a few years: The role of journal self-citations, Scientometrics, Vol. 80, No. 2 (2009) 517—530 • Peter Vinkler, The pi-index: a new indicator for assessing scientific impact, Journal of Information Science, Vol. 35, No. 5, 602-612 (2009) • Peter Vinkler, An attempt for defining some basic categories of scientometrics and classifying the indicators of evaluative scientometrics, Scientometrics, Vol. 50, No. 3 (2001) 539-544 • Peter Jacso, Testing the Calculation of a Realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster, LIBRARY TRENDS, Vol. 56, No. 4, Spring 2008 pp. 784-815 • R. K. Merton, ―The Matthew Effect in Science,‖ Science, vol. 159, no. 3810, pp. 56–63, January 1968. • J. Beel and B. Gipp, ―The Potential of Collaborative Document Evaluation for Science,‖ in 11th International Conference on Digital Asian Libraries (ICADL'08), ser. Lecture Notes in Computer Science (LNCS), G. Buchanan, M. Masoodian, and S. J. Cunningham, Eds., vol. 5362. Heidelberg (Germany): Springer, December 2008, pp. 375–378. • Tang, J. and Zhang, J. and Yao, L. and Li, J. and Zhang, L. and Su, Z., Arnetminer: Extraction and mining of academic social networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 990-- 998, 2008, ACM. • B H Weinberg, The Earliest Hebrew Citation Indexes, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 48(4):318–330, 1997 • Richard Van Noorden (2010), A profusion of measures, Nature Vol 465 • Leo Egghe, Raf Guns, Ronald Rousseau(2011), Thoughts on Uncitedness: Nobel Laureates and Fields Medalists as Case Studies • M.H. MacRoberts and B.R. MacRoberts, Problems of Citation Analysis: A Study of Uncited and Seldom-Cited Influences (2011)