DevEX - reference for building teams, processes, and platforms
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data Mining
1. Publish or Perish:
Towards a Ranking of Scientists using
Bibliographic Data Mining
Lior Rokach
Department of Information Systems Engineering
Ben-Gurion University of the Negev
2. About Me
Prof. Lior Rokach
Department of Information Systems Engineering
Faculty of Engineering Sciences
Head of the Machine Learning Lab
Ben-Gurion University of the Negev
Email: liorrk@bgu.ac.il
http://www.ise.bgu.ac.il/faculty/liorr/
PhD (2004) from Tel Aviv University
3. Outline:
• What is bibliometrics?
• Short tutorial on bibiometrics measures
• Our methodology: data mining
• Task 1: Academic positions
• Task 2: AAAI Fellowship
• Results
• Conclusions
5. Bibliometrics
• “Man is an animal that writes letters”
– Attributed to Lewis Carroll (Charles Dodgson)
• Scientist is an animal that writes papers
• Bibliometrics is measurement of (scientific) publications
• The simplest measure – Number of publications -
Disadvantage: counts Quantity and disregards Quality
6. Publish or Perish
“I don‟t mind your thinking slowly. I mind your
publishing faster than you can think.”
(The Nobel Laureates physicist Wolfgang Pauli)
7. Metrics: Do metrics matter?
• According to Abbott et al.
(Nature, 2010):
– Department heads says ―No‖
• ―External letters trump everything,‖
– But …
• Admit that ―those „qualitative‟ letters
of recommendation sometimes bring in
quantitative metrics by the back door‖
• Most of the researchers (70%) believe
it has an effect
9. Citation Index
A citation index is an index of citations
between publications, allowing the user to
easily establish which later documents cite
which earlier documents
10. The First Citation Index
Cited by
The first citation index is attributed to the Hebrew Talmud (see above),
Dated th Centaury (Weinberg, 1997), while other refer to Shepard's
Citations created in 1873 as the first citation index.
11. Simple Citations-Based Measures
to Evaluate Scientists
• Total Citations (and its squared root)
• Total Citations normalized by number of authors
• Mean number of citations per year
• Mean number of citations per paper
12. Why citations are not always ideal way
to evaluate researchers 'publications
• Uncitedness: It is a sobering fact that some 90% of articles that have
been published in academic journals are never cited. Even Nobel
Laureates have a rather large fraction (10% or more) of uncited
publications (Egghe et al., 2011).
• But the terms ―uncited‖ or ―seldom cited,‖ they are usually referring
to uncited or seldom-cited in the journals monitored by Thomson
Reuters and other similar databases, not to all journals, books, and
reports;
• ―uncited‖ or ―seldom-cited‖ is not a synonym for ―not used.‖
(MacRoberts MacRoberts, 2011)
• Expert judgment is the best, and in the last resort the only, criterion of
performance,
13. A Brief History of Citation Analysis
• 1955:
– Eugene Garfield - Linguist
– Develop the impact factor.
– Founder of the Institute for Scientific Information (ISI)
• 1997:
– Lee Giles; Kurt D. Bollacker; Steve Lawrence
– Crawl and harvest papers on the web
– Focus mainly on CS
• 2004:
– ―Stand on the shoulders of giants‖
– Freely accessible web search engine for scholarly literature
• 2005:
– Jorge E. Hirsch – Physicist
– Develop the h-Index
• 2007:
– Carl Bergstrom – Biologist
– Establish http://eigenfactor.org/
– Use PageRank algorithm to rank journals
14. 1. Impact Factor (Garfield, 1955)
• Citation Indexes for Science: A New Dimension in
Documentation through Association of Ideas
– Garfield, E., Science, 1955, 122, 108-111
• The impact factor for each journal, as used by Thomson
Scientific, is the average number of citations acquired during
the past two years for papers published over the same period.
―The 2007 Impact factor for journal ABC‖ =
Number of times articles published in ABC during
2005-2006 were cited in indexed journals during 2007
–––––––––-–––––––––––––––––––––––––––––––––––––––––
Number of ―citable‖ articles published by ABC in 2005 and 2006
15. Criticisms of the Impact Factor
• Subject variation: citation studies should be normalized to
take into account variables such as field, discipline etc.
• Long Tail: individual papers is largely uncorrelated to the
impact factor of the journal in which it was published.
• Limited subset of journals are indexed
• Biased toward English-language journals
• Short (two year) snapshot of journal
• Includes self-citations
• Some journals are unfairly promoting their own papers
• Journal Inclusion Criteria are more than just quality
16. Variations of Impact Factor and more:
• Five years Impact Factor
• Cited Half-Life - measure the achievability. The Cited Half-Life of journal J in year
X is the number of years after which 50% of the lifetime citations of J‘s content published in
X have been received.
• Ranking - Journals are often ranked by impact factor in an appropriate ThomsonReuters
subject category. journals can be categorised in multiple subject categories which will cause
their rank to be different and consequently a rank should always be in context to the subject
category being utilised.
Other Journal Ranking:
• Eigenfactor - similar algorithm as Google‘s PageRank
– By this approach, journals are considered to be influential if they are cited often by other
influential journals.
– Removes self-citations
– Looks at five years of data
17. 2. H-Index
(Hirsch, 2005; Egghe and Rousseau, 2006)
• A scientist is said to have Hirsch index h if h of their
total, N, papers have at least h citations each
18. • Using H-Index for Physicists by Hirsh:
– 10-12 tenure decisions
– 18 a full professorship
– 15–20 a fellowship in the American
Physical Society
– 45 or higher membership in the United
States National Academy of Sciences.
• H-Index in IS (Clarke, 2008)
– Using Google Scholar
19. h ~ mn
(m=gradient, n=number of years)
1. m ~ 1, h=20 after 20 years ―Successful Scientists―
2. m ~ 2, h=40 after 20 years ―outstanding scientists―
3. m ~ 3, h=60 (20 years) or h=90 (30 years) ―truly unique
individuals‖
Physics Nobel prizes (last 20 years)
‗h‘ (median) = 35
84 % had ‗h‘ ≥ 30
49 % had m < 1
20. Modified H-Index Metrics
Scientists with the same H-Index
Measure Description Ref
Rational It first calculate how many new citations are needed to increase the h- Ruane and Tol
H-Index index by one point. Let m denote the additional points needed. Thus the (2008)
Distance rational hD=h1+1-m/(2h+1).
Rational A researcher has an h-index of h if h is the largest number of papers with Ruane and Tol
H-Index at least h citations. However, some researchers may have more than h (2008)
X papers, say n, with at least h citations. Let us define x= n-h. Thus the
rational H-Index become hX=h+x/(s-h) where s is the total number of
publications.
e-index The (square root) of the surplus of citations in the h-set beyond h^2, i.e., Chun-Ting
beyond the theoretical minimum required to obtain a h-index of 'h'. The Zhang (2009)
aim of the e-index is to differentiate between scientists with similar h-
indices but different citation patterns.
21. Modified H-Index Metrics
To share the fame in a fair way
multi-authored manuscripts
Measure Description Ref
Individual It divides the standard h-index by the average number of authors in the Batista et al.
h-index articles that contribute to the h-index, in order to reduce the effects of 2006
co-authorship;
Norm It first normalizes the number of citations for each paper by dividing
Individual the number of citations by the number of authors for that paper, then
h-index calculates hI,norm as the h-index of the normalized citation counts.
This approach is much more fine-grained than Batista et al.'s; it more
accurately accounts for any co-authorship effects that might be present
and that it is a better approximation of the per-author impact, which is
what the original h-index set out to provide
Schreiber Schreiber's method uses fractional paper counts (for example, only as Schreiber
Individual one third for three authors.) instead of reduced citation counts to (2008)
h-index account for shared authorship of papers, and then determines the multi-
authored hm index based on the resulting effective rank of the papers
using undiluted citation counts.
22. Modified H-Index Metrics
Age Adjusted
Measure Description Ref
Contemporary It adds an age-related weighting to each cited article less weight to older articles. Sidiropoulos et
h-index The weighting is parametrized; If we use gamma=4 and delta=1, this means that al. (2006)
for an article published during the current year, its citations account four times.
For an article published 4 years ago, its citations account only one time. For an
article published 6 years ago, its citations account 4/6 times, and so on.
AR-index It is an age-weighted citation rate, where the number of citations to a given paper Jin (2007)
is divided by the age of that paper. Jin defines the AR-index as the square root of
the sum of all age-weighted citation counts over all papers that contribute to the
h-index.
AWCR Like AR-index but sum over all papers instead (In particular, it allows younger
and as yet less cited papers to contribute to the AWCR, even though they may
not yet contribute to the h-index.)
23. Revised H-Index Metrics
Others
Measure Description Ref
AWCRpA The per-author age-weighted citation rate is similar to the plain
AWCR, but is normalized to the number of authors for each
paper.
g-Index Given a set of articles ranked in decreasing order of the number Leo Egghe
of citations that they received, the g-index is the (unique) (2006)
largest number such that the top g articles received (together) at
least g^2 citations. It aims to improve on the h-index by giving
more weight to highly-cited articles.
Pi-index The pi-index is equal to one hundredth of the number of Vinkler
citations obtained to the top square root of the total number of (2009)
journal papers (‗elite set of papers‘) ranked by the decreasing
number of citations.
24. Modified H-Index Metrics
Scientists with the same H-Index
Measure Description Ref
Rational It first calculate how many new citations are needed to increase the h- Ruane and Tol
H-Index index by one point. Let m denote the additional points needed. Thus the (2008)
Distance rational hD=h1+1-m/(2h+1).
Rational A researcher has an h-index of h if h is the largest number of papers with Ruane and Tol
H-Index at least h citations. However, some researchers may have more than h (2008)
X papers, say n, with at least h citations. Let us define x= n-h. Thus the
rational H-Index become hX=h+x/(s-h) where s is the total number of
publications.
e-index The (square root) of the surplus of citations in the h-set beyond h^2, i.e., Chun-Ting
beyond the theoretical minimum required to obtain a h-index of 'h'. The Zhang (2009)
aim of the e-index is to differentiate between scientists with similar h-
indices but different citation patterns.
25. Modified H-Index Metrics
To share the fame in a fair way
multi-authored manuscripts
Measure Description Ref
Individual It divides the standard h-index by the average number of authors in the Batista et al.
h-index articles that contribute to the h-index, in order to reduce the effects of 2006
co-authorship;
Norm It first normalizes the number of citations for each paper by dividing
Individual the number of citations by the number of authors for that paper, then
h-index calculates hI,norm as the h-index of the normalized citation counts.
This approach is much more fine-grained than Batista et al.'s; it more
accurately accounts for any co-authorship effects that might be present
and that it is a better approximation of the per-author impact, which is
what the original h-index set out to provide
Schreiber Schreiber's method uses fractional paper counts (for example, only as Schreiber
Individual one third for three authors.) instead of reduced citation counts to (2008)
h-index account for shared authorship of papers, and then determines the multi-
authored hm index based on the resulting effective rank of the papers
using undiluted citation counts.
26. Modified H-Index Metrics
Age Adjusted
Measure Description Ref
Contemporary It adds an age-related weighting to each cited article less weight to older articles. Sidiropoulos et
h-index The weighting is parametrized; If we use gamma=4 and delta=1, this means that al. (2006)
for an article published during the current year, its citations account four times.
For an article published 4 years ago, its citations account only one time. For an
article published 6 years ago, its citations account 4/6 times, and so on.
AR-index It is an age-weighted citation rate, where the number of citations to a given paper Jin (2007)
is divided by the age of that paper. Jin defines the AR-index as the square root of
the sum of all age-weighted citation counts over all papers that contribute to the
h-index.
AWCR Like AR-index but sum over all papers instead (In particular, it allows younger
and as yet less cited papers to contribute to the AWCR, even though they may
not yet contribute to the h-index.)
27. Revised H-Index Metrics
Others
Measure Description Ref
AWCRpA The per-author age-weighted citation rate is similar to the plain
AWCR, but is normalized to the number of authors for each
paper.
g-Index Given a set of articles ranked in decreasing order of the number Leo Egghe
of citations that they received, the g-index is the (unique) (2006)
largest number such that the top g articles received (together) at
least g^2 citations. It aims to improve on the h-index by giving
more weight to highly-cited articles.
Pi-index The pi-index is equal to one hundredth of the number of Vinkler
citations obtained to the top square root of the total number of (2009)
journal papers (‗elite set of papers‘) ranked by the decreasing
number of citations.
28. Limitations of H-Index
• The h-index ignores the importance of the publications
– Évariste Galois' h-index is 2, and will remain so forever.
– Had Albert Einstein died in early 1906, his h-index would be
stuck at 4 or 5, despite his high reputation at that date.
• Ignore context of citations:
– Some papers are cited to flesh-out the introduction (related
work)
– Some citations made in a negative context
• Gratuitous authorship
31. Eigenfactor.org Scores
• Eigenfactor score: …the higher the better
– A measure of the overall value provided by all of the articles published
in a given journal in a year; accounts for difference in prestige among
citing journals. A measure of the journal‘s total importance to the
scientific community.
– Eigenfactor scores are scaled so that the sum of the Eigenfactor scores
of all journals listed in Thomson‘s Journal Citation Reports (JCR) is
100.
• Article Influence score: … the higher the better
– Article Influence measures the average influence, per article, of the
papers in a journal. As such, it is comparable to the Impact Factor.
– Article Influence scores are normalized so that the mean article in the
entire Thomson Journal Citation Reports (JCR) database has an article
influence of 1.00.
– Still, it‘s best to ―compare‖ within subjects.
• Cost effectiveness: … the lower the better
– price / eigenfactor [2006 data]
32. Other Journal Ranking Efforts…
SCImago Journal Rank (SJR)
Similar to eigenfactor methods, but based on
citations in Scopus
– Freely available at scimagojr.com
– More journals (~13,500]
– More international diversity
– Uses PageRank algorithm (like eigenfactor.org)
– 3 years of citations; no self-citations
– But: Scopus only has citations back to ~1995
36. A Few Other Journal Ranking
Proposals… many would like to use
journal usage stats
• Usage Factors – Based on journal usage
(COUNTER stats [Counting Online Usage of
Networked Electronic Resources]) uksg.org/usagefactors/final
• Y factor, a combination of both the impact
factor and the weighted page rank developed
by Google (Bollen et al., 2006)
• MESUR: MEtrics from Scholarly Usage of
Resources – Uses citations & COUNTER
stats
http://www.mesur.org/MESUR.html
37. Other Measures for Evaluating
Researchers (Tang, et al. 2008)
• Uptrend - Nothing can catch people's eyes more than a rising star.
Uptrend measures are used to define the rising degree of a researcher.
• The information of each author‘s paper including the published date
and conference's impact factor. We use Least Squares Method to fit a
curve from published papers in recent N years. Then we use the curve
to predict one's score in the next year, which is defined as the score of
Uptrend, formally
38. Other Measures for Evaluating
Researchers (Tang, et al. 2008)
• Activity - People's activity is simply defined based
on one's papers published in the last years. We
consider the importance of each paper and thus
define the activity score as:
39. Other Measures for Evaluating
Researchers (Tang, et al. 2008)
• Diversity - Generally, an expert's research may
include several different research fields. Diversity is
defined to quantitatively reflect the degree. In
particular, we first use the author-conference-topic
model (Tang, et al. 2008) to obtain the research
fields for each expert.
40. Other Measures for Evaluating
Researchers (Tang, et al. 2008)
• Sociability - The score of sociability is basically
defined based on how many coauthors an expert
has. We define the score as :
• where #copaperc denotes the number of papers
coauthored between the expert and the coauthor c. In
the next step, we will further consider the location,
organization, nationality information, and research
fields.
42. Bibliometrics Predictive Power
• Prediction of Nobel Laureates –
– The Thomson Reuters rank among the top 0.1% of
researchers in their fields, based on citations of their
published papers over the last two decades.
– Since 2002, of those named Thomson Reuters Citation
Laureates, 12 have gone on to win Nobel Prizes.
• Jensen et al. (2009) used measurements to predict
which f the CNRS researchers will be promoted:
• h index leads to 48% of ―correct‖ promoted scientists
• number of citations gives 46%
• number of published papers only 42%.
43. Research Questions
• Primary Questions:
– To which extent do bibliometrics reflect scientists
ranking in CS?
– Which single measure is the best predictor?
– How should different measures be combined?
• Secondary Questions:
– Which type of manuscripts should be taken into
consideration?
– Does Self-Citation really matter?
– Which citation index is better?
44. Research Methods
• Retrospective analysis of scientists‘ careers:
– Correlating academic positions with bibliometrics
values that evolve as time goes by.
– AAAI Fellowship
• Using Data Mining Techniques for building:
– A snapshot classifier for ranking scientists to their
academic position.
– A decision making model for promoting scientists.
– A classifier for deciding who should be awarded the
AAAI Fellowship each year.
• Comparative analysis
46. ISI Web of Knowledge
• Coverage
– Most Journals (13,000 journals)
– Some Conferences (192,000 conference proceedings)
– Almost no Books (5,000 books)
– All patents (23 million patents)
– 256 subject categories in Science, Social Sciences, and Arts and Humanities,
covering the full range of scholarship and research
– Many citations (716 million) Only Citations that are fully match are
• Accuracy
– Very few errors
– Very few missing values
– No Duplications
47. Google Scholar
• Coverage
– The largest
– Still has limited coverage of pre-1990 publications
– It is criticized for including gray literature in its citation
counts (Sanderson, 2008)
• Accuracy
– Missing values
– Wrong values
– Duplicate entries
48. Why CS?
• Variety of sub-fields with different citation patterns
(Bioinformatics vs AI).
• Different types of important manuscripts (Journal,
Conferences, Books, Chapters, Patents, etc).
• Evolving field (senior professors completed their PhD in
other fields).
• We are personally interested in this field
50. Inclusion/Exclusion Criteria
47 Researchers
– Researchers from Stanford, MIT, Berkley and Yale
– Completed their PhD after 1970
– Researcher name can be disambiguated
– CV:
• Promotion years are known
• No short-cut in the career.
– Total of 724 ―research years‖.
• ISI - Total number of items: 50K (2300 written
by the targeted researchers).
• Google Scholar - Total number of items: 300K
51. H-Index Over Time (for 7 professors)
Drop Page Fields Here
ISI H- INDEX
18
16
14
Name
12
BEJERANO
DEVADAS
10
GIFFORD
GOLDBERG
8
HUDAK
SUDAN
6
TENENBAUM
4
2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Years from Phd
52. Citations Over Time (for 7 professors)
Drop Page Fields Here
Average of ISIfalsefalse0totalCitations
1000
900
800
Name 700
BEJERANO
600
DEVADAS
GIFFORD
500
GOLDBERG
HUDAK 400
SUDAN
TENENBAUM 300
200
100
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Years from Phd
53. Evaluation
• Procedure: Leave One Researcher Out
ln(odds) b wT x
• Base Classifier – Logistics Regression
1
p
• Publication type 1 e b -wT x
– All – All
– All – Journals
– Journals - Journals
• Self-Citations:
– All
– Self-Citation 1 (the target researcher is not one of the authors)
– Self-Citations 2 (no overlap between original set of authors
and the citing paper)
54. Task 1.1: Ranking Researchers
• Rank a researcher to one the following positions,
given only a snapshot of her bibliometrics
measures:
– Post
– Assistant
– Associate
– Full
• Note that we are not aware to scientist previous
position or seniority.
• Default accuracy = 35%
Full
Assistant Associate
Post
55. The Ranking Task – Results
Top 10 Measures
Classification Cited Manuscript Citing Manuscript Self-Citation
Accuracy Source Type Type Level Measure
59.95% ISI Journal Journal 1 g-Index
59.30% ISI Journal Journal 0 g-Index
59.30% ISI Journal Journal 2 g-Index
58.65% ISI All Journal 0 Norm h-index
58.65% ISI All Journal 1 Norm h-index
58.65% ISI All Journal 2 Norm h-index
58.00% ISI Journal Journal 1 Norm h-index
57.74% ISI Journal Journal 0 Norm h-index
57.74% ISI Journal Journal 2 Norm h-index
57.48% Google Journal Journal 2 Rational H Index X
56. The Ranking Task – Results
Least Predictive Measures
Cited
Classification Manuscript Citing Manuscript Self-Citation
Accuracy Source Type Type Level Measure
37.06% Google Journal * * # Publications
Individual #
37.06% Google Journal * * Publications
37.19% Google Journal Journal 0 Schreiber h-index
38.10% Google All All 1 Individual h-index
38.10% Google All All 2 Individual h-index
38.10% ISI All All 1 Schreiber h-index
38.23% ISI All All 0 Schreiber h-index
38.23% ISI All All 2 Schreiber h-index
38.75% ISI All Journal 0 Schreiber h-index
38.75% ISI All Journal 2 Schreiber h-index
* Statistical significance has been found
57. Not by bibliometrics alone
Accuracy = 73.7% !!!
Predicted
Full Associate Assistant Post
0 0 56 3 Post
Actual
0 36 167 15 Assistant
29 145 31 1 Associate
252 31 3 0 Full
Years from PhD
58. Task 1.2: Promoting Researchers
• Given the researcher‘s current position and
her bibliometrics measures, decide if she
should be promoted.
• Measure the absolute deviation in years
from the actual promotion time.
59. Promotion Decision Task - Results
Cited
Self
Manuscript Citing
Citations Manuscript
Measure Calculated as Source Level Type Type Assistant Associate Full Average
Rational H-Index 1 Absolute Value Google 1 All Journal 1.26 1.58 1.88 1.51
Total Citations Change from Last Rank Google 0 Journal All 1.26 1.68 1.88 1.55
Total Citations Change from Last Rank Google 2 Journal All 1.26 1.68 1.88 1.55
Total Citations Change from Last Rank Google 1 Journal All 1.26 1.71 1.88 1.56
Norm Individual H-Index Change from Last Rank Google All Journal 1.28 1.74 1.79 1.56
… … … … … … … … … …
Individual H Index Change from Last Rank Google 1 Journal Journal 1.30 2.03 2.38 1.80
Contemporary H Index Absolute Value Google 1 Journal All 1.46 2.00 2.17 1.81
* No statistical significance has been found
* About 2% of the cases, our system has not recommended to promote a researcher
although this promotion actually took place.
60. Not by bibliometrics alone
Improvement vs. Rank
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
1. Assistant 2. Associate 3. Full
-5.00%
-10.00%
-15.00%
-20.00%
-25.00%
-30.00%
Measure Assistant Associate Full Average
Promoted to Associate-6 years from PhD Rational H-Index 1 1.26 1.58 1.88 1.51
Promoted to Full –13years from PhD
Years from Phd 1.02 1.72 2.38 1.45
66. Conclusions – Take 1
• Seniority is a good indicator for
promoting scientists in leading USA
universities.
• Variation in bibliometrics among
scientists slightly contribute to the
promotion timing.
• No significant difference between ISI
and Google
• Self-Citation is not so important
• After all, journals are more reliable
than other publications.
68. AAAI Fellowsihp
Try to determine if and when an AI scientist is
qualified to be elected to the AAAI Fellowship
Data set:
– 92 researchers that won the award from 1995 to
2009 only
– 200 randomly selected AI researchers with at least 5
papers in top tier AI Journals/Conferences
– Using ISI data.
• Google Scholar Coming soon
69. Task 2.1 – Leave One Scientist Out
Criterion Average Performance
Not Identifying a fellow (False Negative) 21%
Wrongly identifying a non-fellow (False Positive) 8.2%
70. Using a single measure
Fellows
H-Index
Criterion Average Performance
Not Identifying a fellow (False Negative) 48%
Wrongly identifying a non-fellow (False Positive) 6.1%
73. Rules Example
• (TC/A = '(65.7085-inf)') and (TP/A = '(26.084-inf)') and (Ih = '(3.565-inf)')
and (CpY = '(13.191-inf)') => FellowWon=TRUE (49.0/5.0)
• (Pi = '(0.645-inf)') and (AWCR = '(1.0555-3.6035]') and (TC/A = '(80.875-
inf)') => FellowWon=TRUE (29.0/3.0)
• (TP = '(7.5-inf)') and (e = '(6.595-inf)') and (TP = '(47.5-inf)') and (AWCR =
'(1.0735-3.849]') and (AWCRpA = '(2.1705-inf)') and (SIh = '(0.5-3.5]') =>
FellowWon=TRUE (18.0/1.0)
• …
74. Task 2.3 – Social Network
• Based on the idea of Erdos
number
• Predict fellowship based
on co-authorship with
other fellows.
• http://academic.research.m
icrosoft.com/VisualExplor
er.aspx#1802181&84132
75. Task 2.3
Criterion Average Performance
Not Identifying a fellow (False Negative) 52%
Wrongly identifying a non-fellow (False Positive) 6.6%
+
Criterion Average Performance
Not Identifying a fellow (False Negative) 21%
Wrongly identifying a non-fellow (False Positive) 8.2%
=
Criterion Average Performance
Not Identifying a fellow (False Negative) 16%
Wrongly identifying a non-fellow (False Positive) 5.9%
76. Task 2.3
• (Count >= 5) and (CpP >= 7) and (TP/A >=
6.883) => Fellow=TRUE (51.0/3.0)
• (TP/A >= 22.944) and (Avg <= 3.266667) and
(TP <= 40) => Fellow=TRUE (23.0/3.0)
• (Count >= 5) and (e >= 7.071) and (CpP <=
1.618) => Fellow=TRUE (11.0/1.0)
• …
77. Conclusions – Take 2
• Bibliometric measures can be used to
predict fellowship
• Combining various measures using data
nining techniques improve prediction power
• Co-authorship relations can slightly boost
the accuracy
78.
79. Very Near Future Work
• Adding Google scholar dataset
• Examine the contribution of conferences in
predicting the fellowship.
• Tell Me Who Cite You, …
80. Why God Never Received
Tenure at Any University
1) He had only one major publication.
2) It was in Hebrew.
3) It had no references.
4) It wasn't published in a refereed journal.
5) Some even doubt he wrote it himself.
6) It may be true that he created the world, but what has he done since then?
7) His cooperative efforts have been quite limited.
8) The scientific community has had a hard time replicating his results.
9) He never applied to the Ethics Board for permission to use human subjects.
10) When an experiment went awry, he tried to cover it up by drowning the subjects.
11) When subjects didn't behave as predicted, he deleted them from the sample.
12) He rarely came to class, just told students to read the book.
13) Some say he had his son teach the class.
14) He expelled his first two students for learning.
15) Although there were only ten requirements, most students failed his tests.
16) His office hours were infrequent and usually held on a mountaintop.
81. •
References
JOHAN BOLLEN, MARKO A. RODRIGUEZ, HERBERT VAN DE SOMPEL, Journal status, Scientometrics, Vol. 69, No. 3 (2006) 669-
687
• Christenson J A, Sigelman L. Accrediting knowledge: Journal stature and citation impact in social science. Soc. Sci. Quart. 66:964-
75, 1985.
• RAAN, A. F. J, VAN (2006), Performance-related differences of bibliometric statistical properties of research groups: cumulative
advantages and hierarchically layered networks, Journal of the American Society for Information Science and Technology, 57 (14) : 1919–
1935.
• EPSTEIN, D. (2007), Impact factor manipulation. The Write Stuff, 16 : 133–134.
• ANTONIA ANDRADE, RAÚL GONZÁLEZ-JONTE, JUAN MIGUEL CAMPANARIO, Journals that increase their impact factor at
least fourfold in a few years: The role of journal self-citations, Scientometrics, Vol. 80, No. 2 (2009) 517—530
• Peter Vinkler, The pi-index: a new indicator for assessing scientific impact, Journal of Information Science, Vol. 35, No. 5, 602-612
(2009)
• Peter Vinkler, An attempt for defining some basic categories of scientometrics and classifying the indicators of evaluative
scientometrics, Scientometrics, Vol. 50, No. 3 (2001) 539-544
• Peter Jacso, Testing the Calculation of a Realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster, LIBRARY
TRENDS, Vol. 56, No. 4, Spring 2008 pp. 784-815
• R. K. Merton, ―The Matthew Effect in Science,‖ Science, vol. 159, no. 3810, pp. 56–63, January 1968.
• J. Beel and B. Gipp, ―The Potential of Collaborative Document Evaluation for Science,‖ in 11th International Conference on Digital
Asian Libraries (ICADL'08), ser. Lecture Notes in Computer Science (LNCS), G. Buchanan, M. Masoodian, and S. J.
Cunningham, Eds., vol. 5362. Heidelberg (Germany): Springer, December 2008, pp. 375–378.
• Tang, J. and Zhang, J. and Yao, L. and Li, J. and Zhang, L. and Su, Z., Arnetminer: Extraction and mining of academic social
networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 990--
998, 2008, ACM.
• B H Weinberg, The Earliest Hebrew Citation Indexes, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE.
48(4):318–330, 1997
• Richard Van Noorden (2010), A profusion of measures, Nature Vol 465
• Leo Egghe, Raf Guns, Ronald Rousseau(2011), Thoughts on Uncitedness: Nobel Laureates and Fields Medalists as Case Studies
• M.H. MacRoberts and B.R. MacRoberts, Problems of Citation Analysis: A Study of Uncited and Seldom-Cited Influences (2011)