Reliability and Comparability of Peer Review Results

Universiteit
Antwerpen
Conference "New Frontiers in Evaluation", Vienna, April 24th-25th 2006.
Reliability and Comparability of
Peer Review Results
Nadine Rons, Coordinator of Research Evaluations & Policy Studies
Research & Development Department, Vrije Universiteit Brussel
Eric Spruyt, Head of the Research Administration Department
Universiteit Antwerpen

Universiteit
Antwerpen
Reliability and Comparability of Peer Review Results
Nadine Rons (Vrije Universiteit Brussel) & Eric Spruyt (Universiteit Antwerpen) | pag. 2
“Three cheers for peers”
  ‘Three cheers for peers’, Editorial, Nature 439, 118 (12 January
2006).
•  "Thanks are due to researchers who act as
referees, as editors resolve their often contradictory
advice."
•  "Only in a minority of cases does every referee
agree ..."

Universiteit
Antwerpen
Presentation plan
I.  Validation of results
Reliability & comparability
II.  Material investigated
'Ex post' peer review + citation analysis of teams
III.  Investigation of results
Reliability: inter-peer agreement & different rating habits
Comparability: related concepts & intrinsic characteristics
IV.  Conclusions
Aimed at improved results, a better understanding, choosing the
right method

Universiteit
Antwerpen
I. Validation of results
1. Reliability
Peer review: principal method to evaluate research quality.
BUT: various kinds of bias & different rating habits.
& Not always feasible to use measures limiting their influence.
⇒  Possible to measure reliability ?
2. Comparability
  H F Moed (2005), 'Citation Analysis in Research Evaluation', chapter
18: 'Peer Review and the Validity of Citation Analysis', Springer.
More reliable results ⇒ better correlations with other outcomes?
Correlations often relatively weak & depending on the discipline.
⇒  Can this be explained? (crucial for further acceptance!)

Universiteit
Antwerpen
II. Material investigated
(Peer review)
1. Peer review
–  Shared principles for the panel-evaluations of teams per discipline:
•  Expertise-based
•  International level
•  Uniform treatment
•  Coherence of results
•  Multi-criteria approach
•  Pertinent advice
–  Exceptions:
•  Different experts for each team (1 discipline at VUB).
•  Specific methodology using different indicators (1 discipline at UA).

Universiteit
Antwerpen
(Peer review @ VUB)
–  VUB-indicators:
  Standard procedure 'VUB-Richtstramien voor de Disciplinegewijze
Onderzoeksevaluaties', VUB Research Council (2001).
•  Scientific merit of the research / uniqueness of the research
•  Research approach / plan / focus / coordination
•  Innovation
•  Quality of the research team
•  Probability that the research objectives will be achieved
•  Research productivity
•  Potential impact on further research and on the development of applications
•  Potential impact for transition to or utility for the community
•  Dominant character of the research (fundamental / applied / policy oriented)
•  Overall research evaluation

Universiteit
Antwerpen
(Peer review @ UA)
–  UA-indicators:
  'Protocol 1998' for the Assessment of Research Quality, Association of Universities of
the Netherlands (VSNU, 1998).
•  Academic quality
•  Academic productivity
•  Scientific relevance
•  Academic perspective
Exception (1 discipline, "partial" indicators):
•  Publications
•  Projects
•  Conference participations
•  Other
•  Globally

Universiteit
Antwerpen
(Citation analysis)
2. Citation analysis
  'New Bibliometric Tools for the Assessment of National Research Performance:
Database Description, Overview of Indicators and First Apllications', H F Moed et al.,
Scientometrics 33 (1995).
–  Centre for Science and Technology Studies (CWTS), Leiden University.
–  Thomson ISI citation indexes, corresponding period, same teams.
–  Indicators include:
•  CPP/JCSm: citations / publication with respect to expectations for the
journals
•  CPP/FCSm: citations / publication with respect to expectations for the field
•  JCSm/FCSm: journal citation score with respect to expectations for the field

Universiteit
Antwerpen
III. Investigation of results
(Overview)
1. Reliability
a. Inter-peer agreement:
Three groups of evaluations according to measured level of agreement.
b. Rating habits:
Panel-procedures vs. exception with different experts for each team.
⇒  Influence on results & on correlations between peer review indicators
investigated.
2. Comparability
a. Related concepts:
'Global' vs. 'partial' indicators & variation with discipline.
b. Intrinsic characteristics of methods:
Contributions to ratings counted differently & scale effects.
⇒  Influence on comparability investigated.

Universiteit
Antwerpen
(1. Reliability, a. Inter-peer agreement)
1.  Reliability
1. a. Inter-peer agreement
In panels: different opinions ⇒ different positions of teams.
⇒  Level of inter-peer agreement measured by correlations
between the ratings from different peers.
⇒  3 groups compared: panels with high, intermediate and low
inter-peer agreement.

Universiteit
Antwerpen
(1. Reliability, a. Inter-peer agreement)
–  Influence on results:
Results compared to citation analysis:
⇒  Better inter-peer agreement = higher number of significant
correlations,
BUT: only at the higher aggregation level of the 3 groups.
⇒  Other mechanisms have a stronger impact on correlations.
–  Influence on correlations between peer review indicators:
Significant correlations for each pair of peer review indicators, for
each of the 3 groups (also for indiviual disciplines).
⇒  Correlations between peer review indicators are relatively robust
for variations in inter-peer agreement.

Universiteit
Antwerpen
(1. Reliability, b. Rating habits)
1.b. Rating habits
Opinions → ratings: according to own habits, reference levels
in other evaluations, scores given to other files, known use
of scores, ...
Two cases compared:
•  Exception with different experts for each team ⇒ scores not
necessarily in line with opinions.
•  Standard panel-evaluations ⇒ uniform reference level.

Universiteit
Antwerpen
(1. Reliability, b. Rating habits)
–  Influence on results:
Results compared to citation analysis:
•  Panel-evaluations: significant correlations for all peer review
indicators with some or all citation analysis indicators (& vice versa).
•  Different experts: significant correlation for only 1 pair of indicators.
⇒  Rating habits can influence results significantly.
–  Influence on correlations between peer review indicators:
•  Panel-evaluations: significant correlations for all pairs of indicators.
•  Different experts: significant correlations for only 8% of the pairs.
⇒  Low observed correlations between indicators (expected to be
correlated) can indicate diverging rating habits.

Universiteit
Antwerpen
(2. Comparability, a. Related concepts)
2. Comparability
2.a. Related concepts
–  Partial indicators (publications, projects, conferences, ...): no significant
correlations between peer review indicators, in contrast to global
indicators (scientific merit, productivity, relevance, ...).
⇒ Performances in different activities are not necessarily correlated.
–  Correlations of peer review with citation analysis indicators: the pairs
correlating best strongly vary with discipline.
⇒ An indicator may not represent a same concept for all subject areas.
⇒ Always use more than one indicator!

Universiteit
Antwerpen
(2. Comparability, b. Intrinsic characteristics)
2.b. Intrinsic characteristics
–  Contributions to ratings:
Different in the minds of peers (pro & contra) and in citation analysis
(positive counts).
–  Scale effects:
Minimum & maximum limits & their position with respect to the mean
value.

Universiteit
Antwerpen
•  Peer rating
frequency
distribution:
–  Peer ratings:
pro & contra,
also elements
counted
'negatively'.
–  Scale:
minimum &
maximum
limit.
Relative frequency distribution of peer results
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
LO
W
(1)
LO
W
(2)
FAIR
(3)
FAIR
(4)AVERAG
E
(5)AVERAG
E
(6)
G
O
O
D
(7)
G
O
O
D
(8)
H
IG
H
(9)
H
IG
H
(10)
Peer results
Percentageofthenumberofteams(58)
Scientific merit of the research —
Uniqueness of the research
Research approach / plan / focus /
co-ordination
Innovation
Quality of the research team
Probability that the research
objectives will be achieved
Research productivity
Potential impact on further
research and on the
development of applications
Potential for transition to or utility
for the community
Overall research evaluation

Universiteit
Antwerpen
•  Citation impact
frequency
distribution:
–  Citation impact:
only positive
counts, strong
influence of
highly cited
articles.
–  Scale: minimum
limit closer to
mean & no
maximum limit.
Relative frequency distribution of citation impact
All teams in the pure ISI analysis
0%
5%
10%
15%
20%
25%
30%
35%
40%
0,1 0,4 0,7 1 1,3 1,6 1,9 2,2 2,5 2,8 3,1
Indicator value
Percentageofthenumberofteams(60)
CPP/JCSm
CPP/FCSm

Universiteit
Antwerpen
⇒ Good
correlations
only when
effects of
intrinsic
characteristics
can be filtered
out.
Scientific relevance vs. Field citation impact
High & intermediate inter-peer agreement group

Universiteit
Antwerpen
IV. Conclusions
•  Reliability
–  Peer review results can be influenced considerably by rating habits.
–  It is recommended to create a uniform reference level (e.g. using
panel procedures) or check for signs of low reliability by analysing the
outcomes of the peer evaluation itself.
•  Comparability
–  Besides reliability, comparability of results depends on the nature of
the indicators, on the subject area, on intrinsic characteristics of the
methods, ...
–  Different methods describe different aspects. The most suitable
method should be carefully chosen or developed.
•  Evaluations should always be based on a series of indicators,
never on one single indicator.

Reliability and Comparability of Peer Review Results

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Reliability and Comparability of Peer Review Results

Similaire à Reliability and Comparability of Peer Review Results (20)

Plus de Nadine Rons

Plus de Nadine Rons (10)

Dernier

Dernier (20)

Reliability and Comparability of Peer Review Results