SlideShare une entreprise Scribd logo
1  sur  61
Télécharger pour lire hors ligne
How Much Can We Generalize? Measuring the External
Validity of Impact Evaluations
Eva Vivalt∗
New York University
August 31, 2015
Abstract
Impact evaluations aim to predict the future, but they are rooted in particular
contexts and to what extent they generalize is an open and important question.
I founded an organization to systematically collect and synthesize impact evalu-
ation results on a wide variety of interventions in development. These data allow
me to answer this and other questions for the first time using a large data set
of studies. I consider several measures of generalizability, discuss the strengths
and limitations of each metric, and provide benchmarks based on the data. I
use the example of the effect of conditional cash transfers on enrollment rates to
show how some of the heterogeneity can be modelled and the effect this can have
on the generalizability measures. The predictive power of the model improves
over time as more studies are completed. Finally, I show how researchers can
estimate the generalizability of their own study using their own data, even when
data from no comparable studies exist.
∗
E-mail: eva.vivalt@nyu.edu. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal B´o, Hunt
Allcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel,
Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, Columbia
University, New York University, the World Bank, Cornell University, Princeton University, the University
of Toronto, the London School of Economics, the Australian National University, and the University of
Ottawa, among others, and participants at the 2015 ASSA meeting and 2013 Association for Public Policy
Analysis and Management Fall Research Conference for helpful comments. I am also grateful for the hard
work put in by many at AidGrade over the duration of this project, including but not limited to Jeff Qiu,
Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Mi Shen, Ning Zhang, Jennifer Ambrose, Naomi
Crowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha Jalil, Risa Santoso and
Catherine Razeto.
1
1 Introduction
In the last few years, impact evaluations have become extensively used in development
economics research. Policymakers and donors typically fund impact evaluations precisely to
figure out how effective a similar program would be in the future to guide their decisions
on what course of action they should take. However, it is not yet clear how much we can
extrapolate from past results or under which conditions. Further, there is some evidence
that even a similar program, in a similar environment, can yield different results. For ex-
ample, Bold et al. (2013) carry out an impact evaluation of a program to provide contract
teachers in Kenya; this was a scaled-up version of an earlier program studied by Duflo, Du-
pas and Kremer (2012). The earlier intervention studied by Duflo, Dupas and Kremer was
implemented by an NGO, while Bold et al. compared implementation by an NGO and the
government. While Duflo, Dupas and Kremer found positive effects, Bold et al. showed
significant results only for the NGO-implemented group. The different findings in the same
country for purportedly similar programs point to the substantial context-dependence of im-
pact evaluation results. Knowing this context-dependence is crucial in order to understand
what we can learn from any impact evaluation.
While the main reason to examine generalizability is to aid interpretation and improve
predictions, it would also help to direct research attention to where it is most needed. If
generalizability were higher in some areas, fewer papers would be needed to understand how
people would behave in a similar situation; conversely, if there were topics or regions where
generalizability was low, it would call for further study. With more information, researchers
can better calibrate where to direct their attentions to generate new insights.
It is well-known that impact evaluations only happen in certain contexts. For example,
Figure 1 shows a heat map of the geocoded impact evaluations in the data used in this paper
overlaid by the distribution of World Bank projects (black dots). Both sets of data are geo-
graphically clustered, and whether or not we can reasonably extrapolate from one to another
depends on how much related heterogeneity there is in treatment effects. Allcott (forthcom-
ing) recently showed that site selection bias was an issue for randomized controlled trials
(RCTs) on a firm’s energy conservation programs. Microfinance institutions that run RCTs
and hospitals that conduct clinical trials are also selected (Allcott, forthcoming), and World
Bank projects that receive an impact evaluation are different from those that do not (Vivalt,
2015). Others have sought to explain heterogeneous treatment effects in meta-analyses of
specific topics (e.g. Saavedra and Garcia, 2013, among many others for conditional cash
transfers), or to argue they are so heterogeneous they cannot be adequately modelled (e.g.
Deaton, 2011; Pritchett and Sandefur, 2013).
2
Figure 1: Growth of Impact Evaluations and Location Relative to Programs
The figure on the left shows a heat map of the impact evaluations in AidGrade’s database overlaid by black
dots indicating where the World Bank has done projects. While there are many other development
programs not done by the World Bank, this figure illustrates the great numbers and geographical
dispersion of development programs. The figure on the right plots the number of studies that came out in
each year that are contained in each of three databases described in the text: 3ie’s title/abstract/keyword
database of impact evaluations; J-PAL’s database of affiliated randomized controlled trials; and AidGrade’s
database of impact evaluation results data.
Impact evaluations are still exponentially increasing in number and in terms of the re-
sources devoted to them. The World Bank recently received a major grant from the UK aid
agency DFID to expand its already large impact evaluation works; the Millennium Challenge
Corporation has committed to conduct rigorous impact evaluations for 50% of its activities,
with “some form of credible evaluation of impact” for every activity (Millennium Challenge
Corporation, 2009); and the U.S. Agency for International Development is also increasingly
invested in impact evaluations, coming out with a new policy in 2011 that directs 3% of
program funds to evaluation.1
Yet while impact evaluations are still growing in development, a few thousand are al-
ready complete. Figure 1 plots the explosion of RCTs that researchers affiliated with J-PAL,
a center for development economics research, have completed each year; alongside are the
number of development-related impact evaluations released that year according to 3ie, which
keeps a directory of titles, abstracts, and other basic information on impact evaluations more
broadly, including quasi-experimental designs; finally, the dashed line shows the number of
papers that came out in each year that are included in AidGrade’s database of impact eval-
uation results, which will be described shortly.
1
While most of these are less rigorous “performance evaluations”, country mission leaders are supposed
to identify at least one opportunity for impact evaluation for each development objective in their 3-5 year
plans (USAID, 2011).
3
In short, while we do impact evaluation to figure out what will happen in the future,
many issues have been raised about how well we can extrapolate from past impact evalua-
tions, and despite the importance of the topic, previously we were unable to do little more
than guess or examine the question in narrow settings as we did not have the data. Now we
have the opportunity to address speculation, drawing on a large, unique dataset of impact
evaluation results.
I founded a non-profit organization dedicated to gathering this data. That organization,
AidGrade, seeks to systematically understand which programs work best where, a task that
requires also knowing the limits of our knowledge. To date, AidGrade has conducted 20
meta-analyses and systematic reviews of different development programs.2
Data gathered
through meta-analyses are the ideal data to answer the question of how much we can ex-
trapolate from past results, and since data on these 20 topics were collected in the same
way, coding the same outcomes and other variables, we can look across different types of
programs to see if there are any more general trends. Currently, the data set contains 647 pa-
pers on 210 narrowly-defined intervention-outcome combinations, with the greater database
containing 15,021 estimates.
I define generalizability and discuss several metrics with which to measure it. Other
disciplines have considered generalizability more, so I draw on the literature relating to
meta-analysis, which has been most well-developed in medicine, as well as the psychometric
literature on generalizability theory (Higgins and Thompson, 2002; Shavelson and Webb,
2006; Briggs and Wilson, 2007). The measures I discuss could also be used in conjunction
with any model that seeks to explain variation in treatment effects (e.g. Dehejia, Pop-Eleches
and Samii, 2015) to quantify the proportion of variation that such a model explains. Since
some of the analyses will draw upon statistical methods not commonly used in economics,
I will use the concrete example of conditional cash transfers (CCTs), which are relatively
well-understood and on which many papers have been written, to elucidate the issues.
While this paper focuses on results for impact evaluations of development programs, this
is only one of the first areas within economics to which these kinds of methods can be applied.
In many of the sciences, knowledge is built through a combination of researchers conducting
individual studies and other researchers synthesizing the evidence through meta-analysis.
This paper begins that natural next step.
2
Throughout, I will refer to all 20 as meta-analyses, but some did not have enough comparable outcomes
for meta-analysis and became systematic reviews.
4
2 Theory
2.1 Heterogeneous Treatment Effects
I model treatment effects as potentially depending on the context of the intervention.
Each impact evaluation is on a particular intervention and covers a number of outcomes.
The relationship between an outcome, the inputs that were part of the intervention, and the
context of the study is complex. In the simplest model, we can imagine that context can be
represented a “contextual variable”, C, such that:
Zj “ α ` βTj ` δCj ` γTjCj ` εj (1)
where j indexes the individual, Z represents the value of an aggregate outcome such as
“enrollment rates”, T indicates being treated, and C represents a contextual variable, such
as the type of agency that implemented the program.3
In this framework, a particular impact evaluation might explicitly estimate:
Zj “ α ` β1
Tj ` εj (2)
but, as Equation 1 can be re-written as Zj “ α ` pβ ` γCjqTj ` δCj ` εj, what β1
is really
capturing is the effect β1
“ β ` γC. When C varies, unobserved, in different contexts, the
variance of β1
increases.
This is the simplest case. One can imagine that the true state of the world has “interac-
tion effects all the way down”.
Interaction terms are often considered a second-order problem. However, that intuition
could stem from the fact that we usually look for interaction terms within an already fairly
homogeneous dataset - e.g. data from a single country, at a single point in time, on a par-
ticularly selected sample.
Not all aspects of context need matter to an intervention’s outcomes. The set of con-
textual variables can be divided into a critical set on which outcomes depend and an set on
which they do not; I will ignore the latter. Further, the relationship between Z and C can
vary by intervention or outcome. For example, school meals programs might have more of
an effect on younger children, but scholarship programs could plausibly affect older children
more. If one were to regress effect size on the contextual variable “age”, we would get differ-
ent results depending on which intervention and outcome we were considering. Therefore,
3
Z can equally well be thought of as the average individual outcome for an intervention. Throughout,
I take high values for an outcome to represent a beneficial change unless otherwise noted; if an outcome
represents a negative characteristic, like incidence of a disease, its sign will be flipped before analysis.
5
it will be important in this paper to look only at a restricted set of contextual variables
which could plausibly work in a similar way across different interventions. Additional anal-
ysis could profitably be done within some interventions, but this is outside the scope of this
paper.
Generalizability will ultimately depend on the heterogeneity of treatment effects. The
next section formally defines generalizability for use in this paper.
2.2 Generalizability: Definitions and Measurement
Definition 1 Generalizability is the ability to predict results accurately out of sample.
Definition 2 Local generalizability is the ability to predict results accurately in a particular
out-of-sample group.
There are several ways to operationalize these definitions. The ability to predict
results hinges both on the variability of the results and the proportion that can be
explained. For example, if the overall variability in a set of results is high, this might not
be as concerning if the proportion of variability that can be explained is also high.
It is straightforward to measure the variance in results. However, these statistics need
to be benchmarked in order to know what is a “high” or “low” variance. One advantage
of the large data set used in this paper is that I can use it to benchmark the results
from different intervention-outcome combinations against each other. This is not the first
paper to tentatively suggest a scale. Other rules of thumb have also been created in this
manner, such as those used to consider the magnitude of effect sizes (0-0.2 SD = “small”,
0.2-0.5 = “medium”, ą 0.5 SD = “large”) (Cohen, 1988) or the measure of the impact
of heterogeneity on meta-analysis results, I2
(0.25=“low”, 0.5=“medium”, 0.75=“high”)
(Higgins et al., 2003). I can also compare across-paper variation to within-paper variation,
with the idea that within-study variation should represent a lower bound to across-study
variation within the same intervention-outcome combination. Further, I can create variance
benchmarks based on back-of-the-envelope calculations for what the variance would imply
for predictive power under a set of assumptions. This will be discussed in more detail later.
One potential drawback to considering the variance of studies’ results is that we might
be concerned that studies that have higher effect sizes or are measured in terms of units
with larger scales have larger variances. This would limit us to making comparisons only
between data with the same scale. We could either: 1) restrict attention to those outcomes
in the same natural units (e.g. enrollment rates in percentage points); 2) convert results to
6
be in terms of a common unit, such as standard deviations4
; 3) scale the standard deviation
by the mean result, creating the coefficient of variation. The coefficient of variation
represents the inverse of the signal-to-noise ratio, and as a unitless figure can be compared
across intervention-outcome combinations with different natural units. It is not immune to
criticism, however, particularly in that it may result in large values as the mean approaches
zero.5
All the measures discussed so far focus on variation. However, if we could explain the
variation, it would no longer worsen our ability to make predictions in a new setting, so
long as we had all the necessary data from that setting, such as covariates, with which to
extrapolate.
To explain variation, we need a model. The meta-analysis literature suggests two
general types of models which can be parameterized in many ways: fixed-effect models and
random-effects models.
Fixed-effect models assume there is one true effect of a particular program and all
differences between studies can be attributed simply to sampling error. In other words:
Yi “ θ ` εi (3)
where Yi is the observed effect size of a particular study, θ is the true effect and εi is the
error term.
Random-effects models do not make this assumption; the true effect could potentially
vary from context to context. Here,
Yi “ θi ` εi (4)
“ ¯θ ` ηi ` εi (5)
where θi is the effect size for a particular study i, ¯θ is the mean true effect size, ηi is a
particular study’s divergence from that mean true effect size, and εi is the error. Random-
effects models are more plausible and they are necessary if we think there are heterogeneous
treatment effects, so I use them in this paper. Random-effects models can also be modified
by the addition of explanatory variables, at which point they are called mixed models; I will
also use mixed models in this paper.
Sampling variance, varpYi|θiq, is denoted as σ2
and between-study variance, varpθiq, τ2
.
4
This can be problematic if the standard deviations themselves vary but is a common approach in the
meta-analysis literature in lieu of a better option.
5
This paper follows convention and reports the absolute value of the coefficient of variation wherever it
appears.
7
This variation in observed effect sizes is then:
varpYiq “ τ2
` σ2
(6)
and the proportion of the variation that is not sampling error is:
I2
“
τ2
τ2 ` σ2
(7)
The I2
is an established metric in the meta-analysis literature that helps determine
whether a fixed or random effects model is more appropriate; the higher I2
, the less plausible
it is that sampling error drives all the variation in results. I2
is considered “low” at 0.25,
“medium” at 0.5, and “high” at 0.75 (Higgins et al., 2003).6
If we wanted to explain more of the variation, we could do moderator or mediator analysis,
in which we examine how results vary with the characteristics of the study, characteristics of
its sample, or details about the intervention and its implementation. A linear meta-regression
is one way of accomplishing this goal, explicitly estimating:
Yi “ β0 `
ÿ
n
βnXn ` ηi ` εi
where Xn are explanatory variables. This is a mixed model and, upon estimating it, we can
calculate several additional statistics: the amount of residual variation in Yi, after accounting
for Xn, varRpYi ´ pYiq, the coefficient of residual variation, CVRpYi ´ pYiq, and the residual I2
R.
Further, we can examine the R2
of the meta-regression.
It should be noted that a linear meta-regression is only one way of modelling variation in
Yi. The I2
, for example, is analogous to the reliability coefficient of classical test theory or
the generalizability coefficient of generalizability theory (a branch of psychometrics), both
of which estimate the proportion of variation that is not error. In this literature, additional
heterogeneity is usually modelled using ANOVA rather than meta-regression. Modelling
variation in treatment effects also does not have to occur only retrospectively at the conclu-
sion of studies; we can imagine that a carefully-designed study could anticipate and estimate
some of the potential sources of variation experimentally.
Table 1 summarizes the different indicators, dividing them into measures of variation and
measures of the proportion of variation that is systematic.
Each of these metrics has its advantages and disadvantages. Table 2 summarizes the
6
The Cochrane Collaboration uses a slightly different set of norms, saying 0-0.4 “might not be important”,
0.3-0.6 “may represent moderate heterogeneity”, 0.5-0.9 “may represent substantial heterogeneity”, and 0.75-
1 “considerable heterogeneity” (Higgins and Green, 2011).
8
Table 1: Summary of heterogeneity measures
Measure of variation Measure of proportion
of variation that is
systematic
Measure makes use of
explanatory variables
varpYiq
varRpYi ´ pYiq
CVpYiq
CVRpYi ´ pYiq
I2
I2
R
R2
Table 2: Desirable properties of a measure of heterogeneity
Does not depend
on the number of
studies in a cell
Does not depend
on the precision
of individual es-
timates
Does not depend
on the estimates’
units
Does not depend
on the mean re-
sult in the cell
varpYiq
varRpYi ´ pYiq
CVpYiq
CVRpYi ´ pYiq
I2
I2
R
R2
A “cell” here refers to an intervention-outcome combination. The “precision” of an estimate refers to its
standard error.
desirable properties of a measure of heterogeneity and which properties are possessed by each
of the discussed indicators. Measuring heterogeneity using the variance of Yi requires the
Yi to have comparable units. Using the coefficient of variation requires the assumption that
the mean effect size is an appropriate measure with which to scale sd(Yi). The variance and
coefficient of variation also do not have anything to say about the amount of heterogeneity
that can be explained. Adding explanatory variables also has its limitations. In any model,
we have no way to guarantee that we are indeed capturing all the relevant factors. While
I2
has the nice property that it disaggregates sampling variance as a source of variation,
estimating it depends on the weights applied to each study’s results and thus, in turn, on
the sample sizes of the studies. The R2
has its own well-known caveats, such as that it can
be artificially inflated by over-fitting.
9
Having discussed the different measures of generalizability I will use in this paper, I turn
to describe how I will estimate the parameters of the random effects or mixed models.
2.3 Hierarchical Bayesian Analysis
This paper uses meta-analysis as a tool to synthesize evidence.
As a quick review, there are many steps in a meta-analysis, most of which have to do
with the selection of the constituent papers. The search and screening of papers will be
described in the data section; here, I merely discuss the theory behind how meta-analyses
combine results and estimate the parameters σ2
and τ2
that will be used to generate I2
.
I begin by presenting the random effects model, followed by the related strategy to
estimate a mixed model.
2.4 Estimating a Random Effects Model
To build a hierarchical Bayesian random effects model, I first assume the data are nor-
mally distributed:
Yij|θi „ Npθi, σ2
q (8)
where j indexes the individuals in the study. I do not have individual-level data, but instead
can use sufficient statistics:
Yi|θi „ Npθi, σ2
i q (9)
where Yi is the sample mean and σ2
i the sample variance. This provides the likelihood for θi.
I also need a prior for θi. I assume between-study normality:
θi „ Npµ, τ2
q (10)
where µ and τ are unknown hyperparameters.
Conditioning on the distribution of the data, given by Equation 9, I get a posterior:
θi|µ, τ, Y „ Npˆθi, Viq (11)
where
ˆθi “
Yi
σ2
i
` µ
τ2
1
σ2
i
` 1
τ2
, Vi “
1
1
σ2
i
` 1
τ2
(12)
I then need to pin down µ|τ and τ by constructing their posterior distributions given
non-informative priors and updating based on the data. I assume a uniform prior for µ|τ,
10
and as the Yi are estimates of µ with variance pσ2
i ` τ2
q, obtain:
µ|τ, Y „ Npˆµ, Vµq (13)
where
ˆµ “
ř
i
Yi
σ2
i `τ2
ř
i
1
σ2
i `τ2
, Vµ “
ÿ
i
1
1
σ2
i `τ2
(14)
For τ, note that ppτ|Y q “ ppµ,τ|Y q
ppµ|τ,Y q
. The denominator follows from Equation 12; for the
numerator, we can observe that ppµ, τ|Y q is proportional to ppµ, τqppY |µ, τq, and we know
the marginal distribution of Yi|µ, τ:
Yi|µ, τ „ Npµ, σ2
i ` τ2
q (15)
I use a uniform prior for τ, following Gelman et al. (2005). This yields the posterior for
the numerator:
ppµ, τ|Y q9ppµ, τq
ź
i
NpYi|µ, σ2
i ` τ2
q (16)
Putting together all the pieces in reverse order, I first simulate τ, then generate ppτ|Y q
using τ, followed by µ and finally θi.
2.5 Estimating a Mixed Model
The strategy here is similar. Appendix D contains a derivation.
3 Data
This paper uses a database of impact evaluation results collected by AidGrade, a U.S.
non-profit research institute that I founded in 2012. AidGrade focuses on gathering the
results of impact evaluations and analyzing the data, including through meta-analysis. Its
data on impact evaluation results were collected in the course of its meta-analyses from
2012-2014 (AidGrade, 2015).
AidGrade’s meta-analyses follow the standard stages: (1) topic selection; (2) a search
for relevant papers; (3) screening of papers; (4) data extraction; and (5) data analysis. In
addition, it pays attention to (6) dissemination and (7) updating of results. Here, I will
discuss the selection of papers (stages 1-3) and the data extraction protocol (stage 4); more
detail is provided in Appendix B.
11
3.1 Selection of Papers
The interventions that were selected for meta-analysis were selected largely on the basis
of there being a sufficient number of studies on that topic. Five AidGrade staff members each
independently made a preliminary list of interventions for examination; the lists were then
combined and searches done for each topic to determine if there were likely to be enough
impact evaluations for a meta-analysis. The remaining list was voted on by the general
public online and partially randomized. Appendix B provides further detail.
A comprehensive literature search was done using a mix of the search aggregators Sci-
Verse, Google Scholar, and EBSCO/PubMed. The online databases of J-PAL, IPA, CEGA
and 3ie were also searched for completeness. Finally, the references of any existing system-
atic reviews or meta-analyses were collected.
Any impact evaluation which appeared to be on the intervention in question was included,
barring those in developed countries.7
Any paper that tried to consider the counterfactual
was considered an impact evaluation. Both published papers and working papers were in-
cluded. The search and screening criteria were deliberately broad. There is not enough
room to include the full text of the search terms and inclusion criteria for all 20 topics in
this paper, but these are available in an online appendix as detailed in Appendix A.
3.2 Data Extraction
The subset of the data on which I am focusing is based on those papers that passed all
screening stages in the meta-analyses. Again, the search and screening criteria were very
broad and, after passing the full text screening, the vast majority of papers that were later
excluded were excluded merely because they had no outcome variables in common or did
not provide adequate data (for example, not providing data that could be used to calculate
the standard error of an estimate, or for a variety of other quirky reasons, such as displaying
results only graphically). The small overlap of outcome variables is a surprising and notable
feature of the data. Ultimately, the data I draw upon for this paper consist of 15,021 results
(double-coded and then reconciled by a third researcher) across 647 papers covering the 20
types of development program listed in Table 3.8
For sake of comparison, though the two
organizations clearly do different things, at present time of writing this is more impact eval-
7
High-income countries, according to the World Bank’s classification system.
8
Three titles here may be misleading. “Mobile phone-based reminders” refers specifically to SMS or
voice reminders for health-related outcomes. “Women’s empowerment programs” required an educational
component to be included in the intervention and it could not be an unrelated intervention that merely dis-
aggregated outcomes by gender. Finally, micronutrients were initially too loosely defined; this was narrowed
down to focus on those providing zinc to children, but the other micronutrient papers are still included in
the data, with a tag, as they may still be useful.
12
uations than J-PAL has published, concentrated in these 20 topics. Unfortunately, only 318
of these papers both overlapped in outcomes with another paper and were able to be stan-
dardized and thus included in the main results which rely on intervention-outcome groups.
Outcomes were defined under several rules of varying specificity, as will be discussed shortly.
Table 3: List of Development Programs Covered
2012 2013
Conditional cash transfers Contract teachers
Deworming Financial literacy training
Improved stoves HIV education
Insecticide-treated bed nets Irrigation
Microfinance Micro health insurance
Safe water storage Micronutrient supplementation
Scholarships Mobile phone-based reminders
School meals Performance pay
Unconditional cash transfers Rural electrification
Water treatment Women’s empowerment programs
73 variables were coded for each paper. Additional topic-specific variables were coded for
some sets of papers, such as the median and mean loan size for microfinance programs. This
paper focuses on the variables held in common across the different topics. These include
which method was used; if randomized, whether it was randomized by cluster; whether it
was blinded; where it was (village, province, country - these were later geocoded in a sepa-
rate process); what kind of institution carried out the implementation; characteristics of the
population; and the duration of the intervention from the baseline to the midline or endline
results, among others. A full set of variables and the coding manual is available online, as
detailed in Appendix A.
As this paper pays particular attention to the program implementer, it is worth discussing
how this variable was coded in more detail. There were several types of implementers that
could be coded: governments, NGOs, private sector firms, and academics. There was also a
code for “other” (primarily collaborations) or “unclear”. The vast majority of studies were
implemented by academic research teams and NGOs. This paper considers NGOs and aca-
demic research teams together because it turned out to be practically difficult to distinguish
between them in the studies, especially as the passive voice was frequently used (e.g. “X
was done” without noting who did it). There were only a few private sector firms involved,
so they are considered with the “other” category in this paper.
Studies tend to report results for multiple specifications. AidGrade focused on those
13
results least likely to have been influenced by author choices: those with the fewest con-
trols, apart from fixed effects. Where a study reported results using different methodologies,
coders were instructed to collect the findings obtained under the authors’ preferred method-
ology; where the preferred methodology was unclear, coders were advised to follow the
internal preference ordering of prioritizing randomized controlled trials, followed by regres-
sion discontinuity designs and differences-in-differences, followed by matching, and to collect
multiple sets of results when they were unclear on which to include. Where results were
presented separately for multiple subgroups, coders were similarly advised to err on the side
of caution and to collect both the aggregate results and results by subgroup except where the
author appeared to be only including a subgroup because results were significant within that
subgroup. For example, if an author reported results for children aged 8-15 and then also
presented results for children aged 12-13, only the aggregate results would be recorded, but
if the author presented results for children aged 8-9, 10-11, 12-13, and 14-15, all subgroups
would be coded as well as the aggregate result when presented. Authors only rarely reported
isolated subgroups, so this was not a major issue in practice.
When considering the variation of effect sizes within a group of papers, the definition of
the group is clearly critical. Two different rules were initially used to define outcomes: a
strict rule, under which only identical outcome variables are considered alike, and a loose
rule, under which similar but distinct outcomes are grouped into clusters.
The precise coding rules were as follows:
1. We consider outcome A to be the same as outcome B under the “strict rule” if out-
comes A and B measure the exact same quality. Different units may be used, pending
conversion. The outcomes may cover different timespans (e.g. encompassing both
outcomes over “the last month” and “the last week”). They may also cover different
populations (e.g. children or adults). Examples: height; attendance rates.
2. We consider outcome A to be the same as outcome B under the “loose rule” if they
do not meet the strict rule but are clearly related. Example: parasitemia greater than
4000/µl with fever and parasitemia greater than 2500/µl.
Clearly, even under the strict rule, differences between the studies may exist, however, using
two different rules allows us to isolate the potential sources of variation, and other variables
were coded to capture some of this variation, such as the age of those in the sample. If one
were to divide the studies by these characteristics, however, the data would usually be too
sparse for analysis.
Interventions were also defined separately and coders were also asked to write a short
description of the details of each program. Program names were recorded so as to identify
14
those papers on the same program, such as the various evaluations of PROGRESA.
After coding, the data were then standardized to make results easier to interpret and
so as not to overly weight those outcomes with larger scales. The typical way to compare
results across different outcomes is by using the standardized mean difference, defined as:
SMD “
µ1 ´ µ2
σp
where µ1 is the mean outcome in the treatment group, µ2 is the mean outcome in the control
group, and σp is the pooled standard deviation. When data are not available to calculate the
pooled standard deviation, it can be approximated by the standard deviation of the depen-
dent variable for the entire distribution of observations or as the standard deviation in the
control group (Glass, 1976). If that is not available either, due to standard deviations not
having been reported in the original papers, one can use the typical standard deviation for
the intervention-outcome. I follow this approach to calculate the standardized mean differ-
ence, which is then used as the effect size measure for the rest of the paper unless otherwise
noted.
This paper uses the “strict” outcomes where available, but the “loose” outcomes where
that would keep more data. For papers which were follow-ups of the same study, the most
recent results were used for each outcome.
Finally, one paper appeared to misreport results, suggesting implausibly low values and
standard deviations for hemoglobin. These results were excluded and the paper’s correspond-
ing author contacted. Excluding this paper’s results, effect sizes range between -1.5 and 1.8
SD, with an interquartile range of 0 to 0.2 SD. So as to mitigate sensitivity to individual
results, especially with the small number of papers in some intervention-outcome groups, I
restrict attention to those standardized effect sizes less than 2 SD away from 0, dropping 1
additional observation. I report main results including this observation in the Appendix.
3.3 Data Description
Figure 2 summarizes the distribution of studies covering the interventions and outcomes
considered in this paper that can be standardized. Attention will typically be limited to
those intervention-outcome combinations on which we have data for at least three papers.
Table 13 in Appendix C lists the interventions and outcomes and describes their results in
a bit more detail, providing the distribution of significant and insignificant results. It should
be emphasized that the number of negative and significant, insignificant, and positive and
significant results per intervention-outcome combination only provide ambiguous evidence
of the typical efficacy of a particular type of intervention. Simply tallying the numbers in
15
each category is known as “vote counting” and can yield misleading results if, for example,
some studies are underpowered.
Table 4 further summarizes the distribution of papers across interventions and highlights
the fact that papers exhibit very little overlap in terms of outcomes studied. This is consistent
with the story of researchers each wanting to publish one of the first papers on a topic. Vivalt
(2015a) finds that later papers on the same intervention-outcome combination more often
remain as working papers.
A note must be made about combining data. When conducting a meta-analysis, the
Cochrane Handbook for Systematic Reviews of Interventions recommends collapsing the
data to one observation per intervention-outcome-paper, and I do this for generating the
within intervention-outcome meta-analyses (Higgins and Green, 2011). Where results had
been reported for multiple subgroups (e.g. women and men), I aggregated them as in the
Cochrane Handbook’s Table 7.7.a. Where results were reported for multiple time periods
(e.g. 6 months after the intervention and 12 months after the intervention), I used the most
comparable time periods across papers. When combining across multiple outcomes, which
has limited use but will come up later in the paper, I used the formulae from Borenstein et
al. (2009), Chapter 24.
16
Figure 2: Within-Intervention-Outcome Number of Papers
17
Table 4: Descriptive Statistics: Distribution of Narrow Outcomes
Intervention Number of Mean papers Max papers
outcomes per outcome per outcome
Conditional cash transfers 10 21 37
Contract teachers 1 3 3
Deworming 12 13 18
Financial literacy 1 5 5
HIV/AIDS Education 3 8 10
Improved stoves 4 2 2
Insecticide-treated bed nets 1 9 9
Irrigation 2 2 2
Micro health insurance 1 2 2
Microfinance 5 4 5
Micronutrient supplementation 23 27 47
Mobile phone-based reminders 2 4 5
Performance pay 1 3 3
Rural electrification 3 3 3
Safe water storage 1 2 2
Scholarships 3 4 5
School meals 3 3 3
Unconditional cash transfers 3 9 11
Water treatment 2 5 6
Women’s empowerment programs 2 2 2
Average 4.2 6.5 9.0
18
4 Generalizability of Impact Evaluation Results
4.1 Without Modelling Heterogeneity
Table 5 presents results for the metrics described earlier, within intervention-outcome
combinations. All Yi were converted to be in terms of standard deviations to put them
on a common scale before statistics were calculated, with the aforementioned caveats.
The different measures yield quite different results, as they measure different things, as
previously discussed. The coefficient of variation depends heavily on the mean; the I2
, on
the precision of the underlying estimates.
Table 5: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes
Intervention Outcome var(Yi) CV(Yi) I2
Microfinance Assets 0.000 5.508 1.000
Rural Electrification Enrollment rate 0.001 0.129 0.768
Micronutrients Cough prevalence 0.001 1.648 0.995
Microfinance Total income 0.001 0.989 0.999
Microfinance Savings 0.002 1.773 1.000
Financial Literacy Savings 0.004 5.472 0.891
Microfinance Profits 0.005 5.448 1.000
Contract Teachers Test scores 0.005 0.403 1.000
Performance Pay Test scores 0.006 0.608 1.000
Micronutrients Body mass index 0.007 0.675 1.000
Conditional Cash Transfers Unpaid labor 0.009 0.920 0.797
Micronutrients Weight-for-age 0.009 1.941 0.884
Micronutrients Weight-for-height 0.010 2.148 0.677
Micronutrients Birthweight 0.010 0.981 0.827
Micronutrients Height-for-age 0.012 2.467 0.942
Conditional Cash Transfers Test scores 0.013 1.866 0.995
Deworming Hemoglobin 0.015 3.377 0.919
Micronutrients Mid-upper arm circumference 0.015 2.078 0.502
Conditional Cash Transfers Enrollment rate 0.015 0.831 1.000
Unconditional Cash Transfers Enrollment rate 0.016 1.093 0.998
Water Treatment Diarrhea prevalence 0.020 0.966 1.000
SMS Reminders Treatment adherence 0.022 1.672 0.780
Conditional Cash Transfers Labor force participation 0.023 1.628 0.424
School Meals Test scores 0.023 1.288 0.559
Micronutrients Height 0.023 4.369 0.826
Micronutrients Mortality rate 0.025 2.880 0.201
Micronutrients Stunted 0.025 1.110 0.262
Bed Nets Malaria 0.029 0.497 0.880
Conditional Cash Transfers Attendance rate 0.030 0.523 0.939
19
Micronutrients Weight 0.034 2.696 0.549
HIV/AIDS Education Used contraceptives 0.036 3.117 0.490
Micronutrients Perinatal deaths 0.038 2.096 0.176
Deworming Height 0.049 2.361 1.000
Micronutrients Test scores 0.052 1.694 0.966
Scholarships Enrollment rate 0.053 0.687 1.000
Conditional Cash Transfers Height-for-age 0.055 22.166 0.165
Deworming Weight-for-height 0.072 3.129 0.986
Micronutrients Stillbirths 0.075 3.041 0.108
School Meals Enrollment rate 0.081 1.142 0.080
Micronutrients Prevalence of anemia 0.095 0.793 0.692
Deworming Height-for-age 0.098 1.978 1.000
Deworming Weight-for-age 0.107 2.287 0.998
Micronutrients Diarrhea incidence 0.109 3.300 0.985
Micronutrients Diarrhea prevalence 0.111 1.205 0.837
Micronutrients Fever prevalence 0.146 3.076 0.667
Deworming Weight 0.184 4.758 1.000
Micronutrients Hemoglobin 0.215 1.439 0.984
SMS Reminders Appointment attendance rate 0.224 2.908 0.869
Deworming Mid-upper arm circumference 0.439 1.773 0.994
Conditional Cash Transfers Probability unpaid work 0.609 6.415 0.834
Rural Electrification Study time 0.997 1.102 0.142
How should we interpret these numbers? Higgins and Thompson, who defined I2
, called
0.25 indicative of “low”, 0.5 “medium”, and 0.75 “high” levels of heterogeneity (2002;
Higgins et al., 2003). Figure 3 plots a histogram of the results, with lines corresponding
to these values demarcated. Clearly, there is a lot of systematic variation in the results
according to the I2
measure. No similar defined benchmarks exist for the variance or
coefficient of variation, although studies in the medical literature tend to exhibit a coefficient
of variation of approximately 0.05-0.5 (Tian, 2005; Ng, 2014). By this standard, too, results
would appear quite heterogeneous.
20
Figure 3: Density of I2
values
We can also compare values across the different intervention-outcome combinations
within the data set. Here, the intervention-outcome combinations that fall within the
bottom third by variance have varpYiq ď 0.015; the top third have varpYiq ě 0.052. Similarly,
the threshold delineating the bottom third for the coefficient of variation is 1.14 and, for
the top third, 2.36; for I2
, the thresholds are 0.78 and 0.99, respectively. If we expect these
intervention-outcomes to be broadly comparable to others we might want to consider in the
future, we could use these values to benchmark future results.
Defining dispersion to be “low” or “high” in this manner may be unsatisfying because
the classifications that result are relative. Relative classifications might have some value, but
sometimes are not so important; for example, it is hard to think that there is a meaningful
difference between an I2
of just below 0.99 and an I2
of just above 0.99. An alternative
benchmark that might have more appeal is that of the average within-study variance or
coefficient of variation. If the across-study variation approached the within-study variation,
we might not be so concerned about generalizability.
Table 6 illustrates the gap between the across-study and mean within-study variance,
coefficient of variation, and I2
, for those intervention-outcomes for which we have enough
data to calculate the within-study measures. Not all studies report multiple results for the
intervention-outcome combination in question. A paper might report multiple results for a
particular intervention-outcome combination if, for example, it were reporting results for
different subgroups, such as for different age groups, genders, or geographic areas. The
median within-paper variance for those papers for which this can be generated is 0.027,
while it is 0.037 across papers; similarly, the median within-paper coefficient of variation is
0.91, compared to 1.48 across papers. If we were to form the I2
for each paper separately,
the median within-paper value would be 0.63, as opposed to 0.94 across papers. Figure
21
4 presents the distributions graphically; to increase the sample size, this figure includes
results even when there are only two papers within an intervention-outcome combination or
two results reported within a paper.
22
Table 6: Across-Paper vs. Mean Within-Paper Heterogeneity
Intervention Outcome Across-paper Within-paper Across-paper Within-paper Across-paper Within-paper
var(Yi) var(Yi) CV(Yi) CV(Yi) I2
I2
Micronutrients Cough prevalence 0.001 0.006 1.017 3.181 0.755 1.000
Conditional Cash Transfers Enrollment rate 0.009 0.027 0.790 0.968 0.998 0.682
Conditional Cash Transfers Unpaid labor 0.009 0.004 0.918 0.853 0.781 0.778
Deworming Hemoglobin 0.009 0.068 1.639 8.687 0.583 0.712
Micronutrients Weight-for-height 0.010 0.005 2.252 * 0.665 0.633
Micronutrients Birthweight 0.010 0.011 0.974 0.963 0.784 0.882
Micronutrients Weight-for-age 0.010 0.124 2.370 0.713 1.000 0.652
School Meals Height-for-age 0.011 0.000 1.086 * 0.942 0.703
Micronutrients Height-for-age 0.012 0.042 2.474 3.751 0.993 0.508
Unconditional Cash Transfers Enrollment rate 0.014 0.014 1.223 * 0.982 0.497
SMS Reminders Treatment adherence 0.022 0.008 1.479 0.672 0.958 0.573
Micronutrients Height 0.023 0.028 4.001 3.471 0.896 0.548
Micronutrients Stunted 0.024 0.059 1.085 24.373 0.348 0.149
Micronutrients Mortality rate 0.026 0.195 2.533 1.561 0.164 0.077
Micronutrients Weight 0.029 0.027 2.852 0.149 0.629 0.228
Micronutrients Fever prevalence 0.034 0.011 5.937 0.126 0.602 0.066
Microfinance Total income 0.037 0.003 1.770 1.232 0.970 1.000
Conditional Cash Transfers Probability unpaid work 0.046 0.386 1.419 0.408 0.989 0.517
Conditional Cash Transfers Attendance rate 0.046 0.018 0.591 0.526 0.988 0.313
Deworming Height 0.048 0.112 1.845 0.211 1.000 0.665
Micronutrients Perinatal deaths 0.049 0.015 2.087 0.234 0.451 0.089
Bed Nets Malaria 0.052 0.047 0.650 4.093 0.967 0.551
Scholarships Enrollment rate 0.053 0.026 1.094 1.561 1.000 0.612
Conditional Cash Transfers Height-for-age 0.055 0.002 22.166 1.212 0.162 0.600
HIV/AIDS Education Used contraceptives 0.059 0.120 2.863 6.967 0.424 0.492
Deworming Weight-for-height 0.072 0.164 3.127 * 1.000 0.907
Deworming Height-for-age 0.100 0.005 2.043 1.842 1.000 0.741
Deworming Weight-for-age 0.108 0.004 2.317 1.040 1.000 0.704
Micronutrients Diarrhea incidence 0.135 0.016 2.844 1.741 0.922 0.807
Micronutrients Diarrhea prevalence 0.137 0.029 1.375 3.385 0.811 0.664
Deworming Weight 0.168 0.121 4.087 1.900 0.995 0.813
Conditional Cash Transfers Labor force participation 0.790 0.047 2.931 4.300 0.378 0.559
Micronutrients Hemoglobin 2.650 0.176 2.982 0.731 1.000 0.996
Within-paper values are based on those papers which report results for different subsets of the data. For closer comparison of the across and
within-paper statistics, the across-paper values are based on the same data set, aggregating the within-paper results to one observation per
23
intervention-outcome-paper, as discussed. Each paper needs to have reported 3 results for an intervention-outcome combination for it to be included
in the calculation, in addition to the requirement of there being 3 papers on the intervention-outcome combination. Due to the slightly different
sample, the across-paper statistics diverge slightly from those reported in Table 5. Occasionally, within-paper measures of the mean equal or
approach zero, making the coefficient of variation undefined or unreasonable; “*” denotes those coefficients of variation that were either undefined or
greater than 10,000,000.
24
Figure 4: Distribution of within and across-paper heterogeneity measures
We can also gauge the magnitudes of these measures by comparison with effect sizes.
We know effect sizes are typically considered “small” if they are less than 0.2 SDs and that
the largest coefficient of variation typically considered in the medical literature is 0.5 (Tian,
2005; Ng, 2014). Taking 0.5 as a very conservative upper bound for a “small” coefficient of
variation, this would imply a variance of less than 0.01 for an effect size of 0.2. That the
actual mean effect size in the data is closer to 0.1 makes this even more of an upper bound;
applying the same reasoning to an effect size of 0.1 would result in the threshold being set
at a variance of 0.0025.
Finally, we can try to set bounds more directly, based on the expected prediction error.
Here it is immediately apparent that what counts as large or small error depends on the
policy question. In some cases, it might not matter if an effect size were mis-predicted by
25%; in others, a prediction error of this magnitude could mean the difference between
choosing one program over another or determine whether a program is worthwhile to pursue
at all.
Still, if we take the mean effect size within an intervention-outcome to be our “best
guess” of how a program will perform and, as an illustrative example, want the prediction
error to be less than 25% at least 50% of the time, this would imply a certain cut-off
threshold for the variance if we assume that results are normally distributed. Note that the
assumption that results are drawn from the same normal distribution and the mean and
variance of this distribution can be approximated by the mean and variance of observed
results is a simplification for the purpose of a back-of-the-envelope calculation. We would
expect results to be drawn from different distributions.
Table 7 summarizes the implied bounds for the variance for the prediction error to be
less than 25% and 50%, respectively, alongside the actual variance in results within each
intervention-outcome. In only 1 of 51 cases is the true variance in results smaller than the
variance implied by the 25% prediction error cut-off threshold, and in 9 other cases it is
below the 50% prediction error threshold. In other words, the variance of results within
each intervention-outcome would imply a prediction error of more than 50% more than 80%
of the time.
Table 7: Actual Variance vs. Variance for Prediction Error Thresholds
Intervention Outcome ¯Yi varpYiq var25 var50
Microfinance Assets 0.003 0.000 0.000 0.000
25
Rural Electrification Enrollment rate 0.176 0.001 0.005 0.027
Micronutrients Cough prevalence -0.016 0.001 0.000 0.000
Microfinance Total income 0.029 0.001 0.000 0.001
Microfinance Savings 0.027 0.002 0.000 0.001
Financial Literacy Savings -0.012 0.004 0.000 0.000
Microfinance Profits -0.013 0.005 0.000 0.000
Contract Teachers Test scores 0.182 0.005 0.005 0.029
Performance Pay Test scores 0.131 0.006 0.003 0.015
Micronutrients Body mass index 0.125 0.007 0.002 0.014
Conditional Cash Transfers Unpaid labor 0.103 0.009 0.002 0.009
Micronutrients Weight-for-age 0.050 0.009 0.000 0.002
Micronutrients Weight-for-height 0.045 0.010 0.000 0.002
Micronutrients Birthweight 0.102 0.010 0.002 0.009
Micronutrients Height-for-age 0.044 0.012 0.000 0.002
Conditional Cash Transfers Test scores 0.062 0.013 0.001 0.003
Deworming Hemoglobin 0.036 0.015 0.000 0.001
Micronutrients Mid-upper arm circumference 0.058 0.015 0.001 0.003
Conditional Cash Transfers Enrollment rate 0.150 0.015 0.003 0.019
Unconditional Cash Transfers Enrollment rate 0.115 0.016 0.002 0.011
Water Treatment Diarrhea prevalence 0.145 0.020 0.003 0.018
SMS Reminders Treatment adherence 0.088 0.022 0.001 0.007
Conditional Cash Transfers Labor force participation 0.092 0.023 0.001 0.007
School Meals Test scores 0.117 0.023 0.002 0.012
Micronutrients Height 0.035 0.023 0.000 0.001
Micronutrients Mortality rate -0.054 0.025 0.000 0.003
Micronutrients Stunted 0.143 0.025 0.003 0.018
Bed Nets Malaria 0.342 0.029 0.018 0.101
Conditional Cash Transfers Attendance rate 0.333 0.030 0.017 0.096
Micronutrients Weight 0.068 0.034 0.001 0.004
HIV/AIDS Education Used contraceptives 0.061 0.036 0.001 0.003
Micronutrients Perinatal deaths -0.093 0.038 0.001 0.008
Deworming Height 0.094 0.049 0.001 0.008
Micronutrients Test scores 0.134 0.052 0.003 0.016
Scholarships Enrollment rate 0.336 0.053 0.017 0.098
Conditional Cash Transfers Height-for-age -0.011 0.055 0.000 0.000
Deworming Weight-for-height 0.086 0.072 0.001 0.006
Micronutrients Stillbirths -0.090 0.075 0.001 0.007
School Meals Enrollment rate 0.250 0.081 0.009 0.054
Micronutrients Prevalence of anemia 0.389 0.095 0.023 0.131
Deworming Height-for-age 0.159 0.098 0.004 0.022
Deworming Weight-for-age 0.143 0.107 0.003 0.018
Micronutrients Diarrhea incidence 0.100 0.109 0.002 0.009
Micronutrients Diarrhea prevalence 0.277 0.111 0.012 0.066
26
Micronutrients Fever prevalence 0.124 0.146 0.002 0.013
Deworming Weight 0.090 0.184 0.001 0.007
Micronutrients Hemoglobin 0.322 0.215 0.016 0.090
SMS Reminders Appointment attendance rate 0.163 0.224 0.004 0.023
Deworming Mid-upper arm circumference 0.373 0.439 0.021 0.121
Conditional Cash Transfers Probability unpaid work -0.122 0.609 0.002 0.013
Rural Electrification Study time 0.906 0.997 0.125 0.710
var25 represents the variance that would result in a 25% prediction error for draws from a normal
distribution centered at ¯Yi. var50 represents the variance that would result in a 50% prediction error.
4.2 With Modelling Heterogeneity
4.2.1 Across Intervention-Outcomes
All the results so far have not considered how much heterogeneity can be explained.
If the heterogeneity can be systematically modelled, it would improve our ability to make
predictions. Do results exhibit any variation that is systematic? To begin, I first present
some OLS results, looking across different intervention-outcome combinations, to examine
whether effect sizes are associated with any characteristics of the program, study, or sample,
pooling data from different intervention-outcomes.
As Table 8 indicates, there is some evidence that studies with a smaller number of
observations have greater effect sizes than studies based on a larger number of observations.
This is what we would expect if specification searching were easier in small datasets; this
pattern of results would also be what we would expect if power calculations drove researchers
to only proceed with studies with small sample sizes if they believed the program would result
in a large effect size or if larger studies are less well-targeted. Interestingly, government-
implemented programs fare worse even controlling for sample size (the dummy variable
category left out is “Other-implemented”, which mainly consists of collaborations and private
sector-implemented interventions). Studies in the Middle East / North Africa region may
appear to do slightly better than those in Sub-Saharan Africa (the excluded region category),
but not much weight should be put on this as very few studies were conducted in the former
region.
While these regressions have the advantage of allowing me to draw on a larger sample
of studies and we might think that any patterns observed across so many interventions and
outcomes are fairly robust, we might be able to explain more variation if we restrict attention
to a particular intervention-outcome combination. I therefore focus on the case of conditional
cash transfers (CCTs) and enrollment rates, as this is the intervention-outcome combination
that contains the largest number of papers.
27
Table 8: Regression of Effect Size on Study Characteristics
(1) (2) (3) (4) (5)
Effect size Effect size Effect size Effect size Effect size
b/se b/se b/se b/se b/se
Number of -0.011** -0.012*** -0.009*
observations (100,000s) (0.00) (0.00) (0.00)
Government-implemented -0.107*** -0.087**
(0.04) (0.04)
Academic/NGO-implemented -0.055 -0.057
(0.04) (0.05)
RCT 0.038
(0.03)
East Asia -0.003
(0.03)
Latin America 0.012
(0.04)
Middle East/North 0.275**
Africa (0.11)
South Asia 0.021
(0.04)
Constant 0.120*** 0.180*** 0.091*** 0.105*** 0.177***
(0.00) (0.03) (0.02) (0.02) (0.03)
Observations 556 656 656 556 556
R2
0.20 0.23 0.22 0.23 0.20
28
4.2.2 Within an Intervention-Outcome Combination: The Case of CCTs and
Enrollment Rates
The previous results used the across-intervention-outcome data, which were aggregated
to one result per intervention-outcome-paper. However, we might think that more variation
could be explained by carefully modelling results within a particular intervention-outcome
combination. This section provides an example, using the case of conditional cash transfers
and enrollment rates, the intervention-outcome combination covered by the most papers.
Suppose we were to try to explain as much variability in outcomes as possible, using
sample characteristics. The available variables which might plausibly have a relationship to
effect size are: the baseline enrollment rates9
; the sample size; whether the study was done
in a rural or urban setting, or both; results for other programs in the same region10
; and
the age and gender of the sample under consideration.
Table 9 shows the results of OLS regressions of the effect size on these variables, in turn.
The baseline enrollment rates show the strongest relationship to effect size, as reflected in
the R2 and significance levels: it is easier to have large gains where initial rates are low.
Some papers pay particular attention to those children that were not enrolled at baseline or
that were enrolled at baseline. These are coded as a “0%” or “100%” enrollment rate at
baseline but are also represented by two dummy variables. Larger studies and studies done
in urban areas also tend to find smaller effect sizes than smaller studies or studies done in
rural or mixed urban/rural areas. Finally, for each result I calculate the mean result in the
same region, excluding results from the program in question. Results do appear slightly
correlated across different programs in the same region.
9
In some cases, only endline enrollment rates are reported. This variable is therefore constructed by
using baseline rates for both the treatment and control group where they are available, followed by, in turn,
the baseline rate for the control group; the baseline rate for the treatment group; the endline rate for the
control group; the endline rate for the treatment and control group; and the endline rate for the treatment
group
10
Regions include: Latin America, Africa, the Middle East and North Africa, East Asia, and South Asia,
following the World Bank’s geographical divisions.
29
Table 9: Regression of Projects’ Effect Sizes on Characteristics (CCTs on Enrollment Rates)
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
ES ES ES ES ES ES ES ES ES ES
b/se b/se b/se b/se b/se b/se b/se b/se b/se b/se
Enrollment Rates -0.224*** -0.092 -0.127***
(0.05) (0.06) (0.02)
Enrolled at Baseline -0.002
(0.02)
Not Enrolled at 0.183*** 0.142***
Baseline (0.05) (0.03)
Number of -0.011* -0.002
Observations (100,000s) (0.01) (0.00)
Rural 0.049** 0.002
(0.02) (0.02)
Urban -0.068*** -0.039**
(0.02) (0.02)
Girls -0.002
(0.03)
Boys -0.019
(0.02)
Minimum Sample Age 0.005
(0.01)
Mean Regional Result 1.000** 0.714**
(0.38) (0.28)
Observations 112 112 108 130 130 130 130 104 130 92
R2
0.41 0.52 0.01 0.06 0.05 0.00 0.01 0.02 0.01 0.58
30
Table 10: Impact of Mixed Models on Measures
var(Yi) varRpYi ´ pYiq CV(Yi) CVRpYi ´ pYiq I2
I2
R N
Random effects model 0.011 0.011 1.24 1.24 0.97 0.97 122
Mixed model (1) 0.011 0.007 1.28 1.04 0.97 0.96 104
Mixed model (2) 0.012 0.005 1.25 0.85 0.96 0.93 87
As baseline enrollment rates have the strongest relationship to effect size, I use this as
an explanatory variable in a hierarchical mixed model, to explore how it affects the residual
varRpYi ´ pYiq, CVRpYi ´ pYiq and I2
R. I also use the specification in column (10) of Table
9 as a robustness check. The results are reported in Table 10 for each of these two mixed
models, alongside the values from the random effects model that does not use any explanatory
variables.
Not all papers provide information for each explanatory variable, and each row is based
on only those studies which could be used to estimate the model. Thus, the value of varpYiq,
CV(Yi) and I2
, which do not depend on the model used, may still vary between rows.
In the random effects model, since no explanatory variables are used, pYi is only the mean,
and varRpYi ´ pYiq, CVRpYi ´ pYiq and I2
R do not offer improvements on var(Yi), CV(Yi) and I2
.
As more explanatory variables are added, the gap between varpYiq and varRpYi ´ pYiq, CV(Yi)
and CVRpYi ´ pYiq and I2
and I2
R grows. In all cases, including explanatory variables can
help reduce the unexplained variation, to varying degrees. varRpYi ´ pYiq and CVRpYi ´ pYiq
are greatly reduced, but I2
R is not much lower than I2
. This is likely due to a feature of I2
(I2
R) previously discussed: that it depends on the precision of estimates. With evaluations
of CCT programs tending to have large sample sizes, the value of I2
(I2
R) is higher than it
otherwise would be.
4.2.3 How Quickly Do Results Converge?
As more studies are completed, our ability to make predictions based on the previous
studies’ results might improve.
In the overall data set, results do not appear to converge or diverge over time. Figure 5
provides a scatter plot of the relationship between the absolute percent difference between
a particular result and the chronological order of the paper relative to others on the same
intervention-outcome, scaled to run from 0 to 1. For example, if there were 5 papers on a
particular intervention-outcome combination, the first would take the value 0.2, the last, 1.
In this figure, attention is restricted to those percent differences less than 1000%. There is
a weak positive relationship between them, indicating that earlier results tend to be closer
to the mean result than the later results, which are more variable, but this is not signifi-
31
Figure 5: Variance of Results Over Time, Within Intervention-Outcome
cant. Further, the relationship varies according to the cutoff used. Table 17 in Appendix C
illustrates.
However, it is still possible that if we can fit a model of the effect sizes to the data, as
we did in the case of CCTs, the fit of the model could improve over time as more data are
added.
To test this, I run the previous OLS regressions of effect size on a constant and baseline
enrollment rates using the data available at time period t and measure the absolute error
of the predicted values of pYi that would be generated by applying the estimated coefficients
to the data from future time periods. I consider prediction error at time period t ` 1 and,
separately, the mean absolute prediction error across all future time periods (t ` 1, t ` 2, ...)
in alternative specifications.
Results regressing the error on the number of papers used to generate the coefficients are
shown in Table 11. Since multiple papers may have come out in the same year, there are
necessarily discrete jumps in the number of results available at different time periods t, and
results are bootstrapped.
Overall, it appears that the fit can be improved over time. The fit of model 2, in
particular, improves over the first 30-60 studies and afterwards does not show much further
reduction in error, though the fit of other models could take longer to converge. It is possible
that leveraging within-paper heterogeneity could speed convergence. The next section will
explore the relationship between within-study heterogeneity and across-study heterogeneity.
32
Table 11: Prediction Error from Mixed Models Declines As Evidence Accumulates
Model 1 Model 1 Model 2 Model 2
Absolute Mean Absolute Absolute Mean Absolute
Error Error Error Error
Number of Previous 0.003 -0.001 -0.014*** -0.043***
Papers (10s) (0.00) (0.00) (0.00) (0.01)
Constant 0.042** 0.057*** 0.120*** 0.257***
(0.02) (0.00) (0.02) (0.03)
Observations 135 150 111 150
R2
0.01 0.08 0.08 0.42
Columns (1) and (3) focus on the absolute prediction error at time period t ` 1 given the evidence at time
t. Columns (2) and (4) focus on the mean absolute prediction error for all time periods t ` 1, t ` 2, ....
4.3 Predicting External Validity from a Single Paper
It would be very helpful if we could estimate the across-paper within-intervention-
outcome metrics using the results from individual papers. Many papers report results for
different subgroups or over time, and the variation in results for a particular intervention-
outcome within a single paper could be a plausible proxy of variation in results for
that same intervention-outcome across papers. If this relationship holds, it would help
researchers estimate the external validity of their own study, even when no other studies
on the intervention have been completed. Table 12 shows the results of regressing the
across-paper measures of var(Yi) and CV(Yi) on the average within-paper measures for the
same intervention-outcome combination.
33
Table 12: Regression of Mean Within-Paper Heterogeneity on Across-Paper Heterogeneity
(1) (2) (3)
Across-paper variance Across-paper CV Across-paper I2
b/se b/se b/se
Mean within-paper variance 0.343**
(0.13)
Mean within-paper CV 0.000*
(0.00)
Mean within-paper I2
0.543***
(0.10)
Constant 0.101* 0.867 0.453***
(0.06) (0.63) (0.08)
Observations 51 50 51
R2
0.04 0.00 0.31
The mean of each within-paper measure is created by calculating the measure within a paper, for each
paper reporting two or more results on the same intervention-outcome combination, and then averaging
that measure across papers within the intervention-outcome.
It appears that within-paper variation in results is indeed significantly correlated with
across-paper variation in results. Authors could undoubtedly obtain even better estimates
using micro data.
4.4 Robustness Checks
One may be concerned that low-quality papers are either inflating or depressing the
degree of generalizability that is observed. There are infinitely many ways to measure paper
“quality”; I consider two. First, I use the most widely-used quality assessment measure, the
Jadad scale (Jadad et al., 1996). The Jadad scale asks whether the study was randomized,
double-blind, and whether there was a description of withdrawals and dropouts. A paper
gets one point for having each of these characteristics; in addition, a point is added if the
method of randomization was appropriate, subtracted if the method is inappropriate, and
similarly added if the blinding method was appropriate and subtracted if inappropriate.
This results in a 0-5 point scale. Given that the kinds of interventions being tested are not
typically readily suited to blinding, I consider all those papers scoring at least a 3 to be
“high quality”.
In an alternative specification, I also consider only those results from studies that were
RCTs. This is for two reasons. First, many would consider RCTs to be higher-quality
34
studies. We might also be concerned about how specification searching and publication bias
could affect results. In a separate paper (Vivalt, 2015a), I discuss these issues at length and
find relatively little evidence of these biases in the data, with RCTs exhibiting even fewer
signs of specification searching and publication bias. The results based on only those studies
which were RCTs thus provide a good robustness check.
Tables 15 and 16 in the Appendix provide robustness checks using these two quality
measures. Table 14 also includes the one observation previously dropped for having an effect
size more than 2 SD away from 0. The heterogeneity measures are not substantially different
using these data sets.
5 Conclusion
How much impact evaluation results generalize to other settings is an important topic,
and data from meta-analyses are the ideal data with which to answer this question. With
data on 20 different types of interventions, all collected in the same way, we can begin to
speak a bit more generally about how results tend to vary across contexts and what that
implies for impact evaluation design and policy recommendations.
I started by discussing heterogeneous treatment effects, defining generalizability, and
relating generalizability to several possible measures. Each measure has its strengths and
limitations, and to get a more complete view multiple measures should be used. I then
discussed the rich data set the results are based on and its formation. I presented results
for each measure, first looking at the basic measures of variation and proportion of variation
that is systematic across intervention-outcome combinations and then looking within the
case of a particular intervention-outcome: the effect of CCTs on enrollment rates.
Smaller studies tended to have larger effect sizes, which we might expect if the smaller
studies are better-targeted, are selected to be evaluated when there is a higher a priori ex-
pectation they will have a large effect size, or if there is a preference to report larger effect
sizes, which smaller studies would obtain more often by chance. Government-implemented
programs also had smaller effect sizes than academic/NGO-implemented programs, even
after controlling for sample size. This is unfortunate given we often do smaller impact eval-
uations with NGOs in the hopes of finding a strong positive effect that can scale through
government implementation.
In the case of the effect of CCTs on enrollment rates, the generalizability measures im-
prove with the addition of an explanatory mixed model. I also found that the predictive
ability of the model improved over time, estimating the model using sequentially larger cuts
of the data (i.e. the evidence base at time t, t ` 1...).
35
Finally, I compared within-paper heterogeneity in treatment effects to across-paper het-
erogeneity in treatment effects. Within-paper heterogeneity is present in my data as papers
often report multiple results for the same outcomes, such as for different subgroups. Fortu-
nately, I find that even these crude measures of within-paper heterogeneity predict across-
paper heterogeneity for the relevant intervention-outcome. This implies that researchers can
get a quick estimate how well their results would apply to other settings, simply by using
their own data. With their access to micro data, authors could do much richer analysis.
Finally, I considered the robustness of these results to specification searching, publication
bias (Vivalt, 2015a), and issues of paper quality. A companion paper finds RCTs fare better
than non-RCTs with respect to specification searching and publication bias, so I present
results based on those studies which are RCTs, as well as separately restricting attention to
those studies that meet a common quality standard.
I consider several ways to evaluate the magnitude of the variation in results. Whether
results are too heterogeneous ultimately depends on the purpose for which they are being
used; some policy decisions might have greater room for error than others. However, it is safe
to say, looking at both the coefficient of variation and the I2
, which have commonly accepted
benchmarks in other disciplines, that these impact evaluations exhibit more heterogeneity
than is typical in other fields such as medicine, even after accounting for explanatory vari-
ables in the case of conditional cash transfers. Further, I find that under mild assumptions,
the typical variance of results is such that a particular program would be mis-predicted by
more than 50% over 80% of the time.
There are some steps that researchers can take that may improve the generalizability
of their own studies. First, just as with heterogeneous selection into treatment (Chassang,
Padr´o i Miquel and Snowberg, 2012), one solution would be to ensure one’s impact evalua-
tion varied some of the contextual variables that we might think underlie the heterogeneous
treatment effects. Given that many studies are underpowered as it is, that may not be
likely; however, large organizations and governments have been supporting more impact
evaluations, providing more opportunities to explicitly integrate these analyses. Efforts to
coordinate across different studies, asking the same questions or looking at some of the same
outcome variables, would also help. The framing of heterogeneous treatment effects could
also provide positive motivation for replication projects in different contexts: different find-
ings would not necessarily negate the earlier ones but add another level of information.
In summary, generalizability is not binary but something that we can measure. This
paper showed that past results have significant but limited ability to predict other results
on the same topic and this was not seemingly due to bias. Knowing how much results tend
to extrapolate and when is critical if we are to know how to interpret an impact evaluation’s
36
results or apply its findings. Given that other fields, with less heterogeneity, also seem to
have a more well-developed practice of replication and meta-analysis, it would seem like
economics would have a lot to gain by expanding in this direction.
37
References
AidGrade (2013). “AidGrade Process Description”, http://www.aidgrade.org/methodology/
processmap-and-methodology, March 9, 2013.
AidGrade (2015). “AidGrade Impact Evaluation Data, Version 1.2”.
Alesina, Alberto and David Dollar (2000). “Who Gives Foreign Aid to Whom and Why?”,
Journal of Economic Growth, vol. 5 (1).
Allcott, Hunt (forthcoming). “Site Selection Bias in Program Evaluation”,
Quarterly Journal of Economics.
Bastardi, Anthony, Eric Luis Uhlmann and Lee Ross (2011). “Wishful Thinking: Belief,
Desire, and the Motivated Evaluation of Scientific Evidence”, Psychological Science.
Becker, Betsy Jane and Meng-Jia Wu (2007). “The Synthesis of Regression Slopes in
Meta-Analysis”, Statistical Science, vol. 22 (3).
Bold, Tessa et al. (2013). “Scaling-up What Works: Experimental Evidence on External
Validity in Kenyan Education”, working paper.
Borenstein, Michael et al. (2009). Introduction to Meta-Analysis. Wiley Publishers.
Boriah, Shyam et al. (2008). “Similarity Measures for Categorical Data: A Comparative
Evaluation”, in Proceedings of the Eighth SIAM International Conference on Data Mining.
Brodeur, Abel et al. (2012). “Star Wars: The Empirics Strike Back”, working paper.
Cartwright, Nancy (2007). Hunting Causes and Using Them: Approaches in Philosophy
and Economics. Cambridge: Cambridge University Press.
Cartwright, Nancy (2010). “What Are Randomized Controlled Trials Good For?”,
Philosophical Studies, vol. 147 (1): 59-70.
Casey, Katherine, Rachel Glennerster, and Edward Miguel (2012). “Reshaping Institutions:
Evidence on Aid Impacts Using a Preanalysis Plan.” Quarterly Journal of Economics, vol.
127 (4): 1755-1812.
Chassang, Sylvain, Gerard Padr I Miquel, and Erik Snowberg (2012). “Selec-
tive Trials: A Principal-Agent Approach to Randomized Controlled Experiments.”
American Economic Review, vol. 102 (4): 1279-1309.
Cohen, Jacob (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Lawrence Earlbaum Associates. Deaton, Angus (2010). “Instruments,
Randomization, and Learning about Development.” Journal of Economic Literature, vol.
48 (2): 424-55.
Duflo, Esther, Pascaline Dupas and Michael Kremer (2012). “School Governance, Teacher
Incentives and Pupil-Teacher Ratios: Experimental Evidence from Kenyan Primary
Schools”, NBER Working Paper.
Evans, David and Anna Popova (2014). “Cost-effectiveness Measure-
38
ment in Development: Accounting for Local Costs and Noisy Impacts”,
World Bank Policy Research Working Paper, No. 7027.
Ferguson, Christopher and Michael Brannick (2012). “Publication bias in psychological
science: Prevalence, methods for identifying and controlling, and implications for the use of
meta-analyses.” Psychological Methods, vol. 17 (1), Mar 2012, 120-128.
Franco, Annie, Neil Malhotra and Gabor Simonovits (2014). “Publication Bias in the Social
Sciences: Unlocking the File Drawer”, Working Paper.
Gerber, Alan and Neil Malhotra (2008a). “Do Statistical Reporting Standards Affect
What Is Published? Publication Bias in Two Leading Political Science Journals”,
Quarterly Journal of Political Science, vol 3.
Gerber, Alan and Neil Malhotra (2008b). “Publication Bias in Empirical So-
ciological Research: Do Arbitrary Significance Levels Distort Published Results”,
Sociological Methods &Research, vol. 37 (3).
Gelman, Andrew et al. (2013). Bayesian Data Analysis, Third Edition, Chapman and
Hall/CRC.
Hedges, Larry and Therese Pigott (2004). “The Power of Statistical Tests for Moderators
in Meta-Analysis”, Psychological Methods, vol. 9 (4).
Higgins, Julian PT and Sally Green, (eds.) (2011).
Cochrane Handbook for Systematic Reviews of
Interventions, Version 5.1.0 [updated March 2011]. The Cochrane Collaboration. Available
from www.cochrane-handbook.org.
Higgins, Julian PT et al. (2003). “Measuring inconsistency in meta-analyses”, BMJ 327:
557-60.
Higgins, Julian PT and Simon Thompson (2002). “Quantifying heterogeneity in a meta-
analysis”, Statistics in Medicine, vol. 21: 1539-1558.
Hsiang, Solomon, Marshall Burke and Edward Miguel (2013). “Quantifying the Influence
of Climate on Human Conflict”, Science, vol. 341.
Independent Evaluation Group (2012). “World Bank Group Impact Evaluations: Relevance
and Effectiveness”, World Bank Group.
Jadad, A.R. et al. (1996). “Assessing the quality of reports of randomized clinical trials: Is
blinding necessary?” Controlled Clinical Trials, 17 (1): 112.
Millennium Challenge Corporation (2009). “Key Elements of Evaluation at MCC”,
presentation June 9, 2009.
Ng, CK (2014). “Inference on the common coefficient of varia-
tion when populations are lognormal: A simulation-based approach”,
Journal of Statistics: Advances in Theory and Applications, vol. 11 (2).
39
Page, Matthew, McKenzie, Joanne and Andrew Forbes (2013). “Many Scenarios Exist
for Selective Inclusion and Reporting of Results in Randomized Trials and Systematic
Reviews”, Journal of Clinical Epidemiology, vol. 66 (5).
Pritchett, Lant and Justin Sandefur (2013). “Context Matters for Size: Why External
Validity Claims and Development Practice Don’t Mix”, Center for Global Development
Working Paper 336.
Rodrik, Dani (2009). “The New Development Economics: We Shall Experiment, but How
Shall We Learn?”, in What Works in Development? Thinking Big, and Thinking Small, ed.
Jessica Cohen and William Easterly, 24-47. Washington, D.C.: Brookings Institution Press.
Saavedra, Juan and Sandra Garcia (2013). “Educational Impacts and Cost-Effectiveness
of Conditional Cash Transfer Programs in Developing Countries: A Meta-Analysis”,
CESR Working Paper.
Simmons, Joseph and Uri Simonsohn (2011). “False-Positive Psychology: Undisclosed
Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”,
Psychological Science, vol. 22.
Simonsohn, Uri et al. (2014). “P-Curve: A Key to the File Drawer”,
Journal of Experimental Psychology: General.
Tian, Lili (2005). “Inferences on the common coefficient of variation”, Statistics in Medicine,
vol. 24: 2213-2220.
Tibshirani, Ryan and Robert Tibshirani (2009). “A Bias Correction for the Minimum Error
Rate in Cross-Validation”, Annals of Applied Statistics, vol. 3 (2).
Tierney, Michael J. et al. (2011). “More Dollars than Sense: Refining Our Knowledge of
Development Finance Using AidData”, World Development, vol. 39.
Tipton, Elizabeth (2013). “Improving generalizations from experiments us-
ing propensity score subclassification: Assumptions, properties, and contexts”,
Journal of Educational and Behavioral Statistics, 38: 239-266.
RePEc (2013). “RePEc h-index for journals”, http://ideas.repec.org/top/
top.journals.hindex.html.
Vivalt, Eva (2015a). “The Trajectory of Specification Searching Across Disciplines and
Methods”, Working Paper.
Vivalt, Eva (2015b). “How Concerned Should We Be About Selection Bias, Hawthorne
Effects and Retrospective Evaluations?”, Working Paper.
Walsh, Michael et. al. (2013). “The Statistical Significance of Randomized
Controlled Trial Results is Frequently Fragile: A Case for a Fragility Index”,
Journal of Clinical Epidemiology.
USAID (2011). “Evaluation: Learning from Experience”, USAID Evaluation Policy,
40
Washington, DC.
41
For Online Publication
42
Appendices
A Guide to Appendices
A.1 Appendices in this Paper
B) Excerpt from AidGrade’s Process Description (2013).
C) Additional results.
D) Derivation of mixed model.
A.2 Further Online Appendices
Having to describe data from twenty different meta-analyses and systematic re-
views, I must rely in part on online appendices. The following are available at
http://www.evavivalt.com/research:
E) The search terms and inclusion criteria for each topic.
F) The references for each topic.
G) The coding manual.
43
B Data Collection
B.1 Description of AidGrade’s Methodology
The following details of AidGrade’s data collection process draw heavily from AidGrade’s
Process Description (AidGrade, 2013).
Figure 6: Process Description
Stage 1: Topic Identification
AidGrade staff members were asked to each independently make a list of at least
thirty international development programs that they considered to be the most interesting.
The independent lists were appended into one document and duplicates were tagged and
removed. Each of the remaining topics was discussed and refined to bring them all to a clear
44
and narrow level of focus. Pilot searches were conducted to get a sense of how many impact
evaluations there might be on each topic, and all the interventions for which the very basic
pilot searches identified at least two impact evaluations were shortlisted. A random subset
of the topics was selected, also acceding to a public vote for the most popular topic.
Stage 2: Search
Each search engine has its own peculiarities. In order to ensure all relevant papers
and few irrelevant papers were included, a set of simple searches was conducted on
different potential search engines. First, initial searches were run on AgEcon; British
Library for Development Studies (BLDS); EBSCO; Econlit; Econpapers; Google Scholar;
IDEAS; JOLISPlus; JSTOR; Oxford Scholarship Online; Proquest; PubMed; ScienceDirect;
SciVerse; SpringerLink; Social Science Research Network (SSRN); Wiley Online Library;
and the World Bank eLibrary. The list of potential search engines was compiled broadly
from those listed in other systematic reviews. The purpose of these initial searches was to
obtain information about the scope and usability of the search engines to determine which
ones would be effective tools in identifying impact evaluations on different topics. External
reviews of different search engines were also consulted, such as a Falagas et al. (2008) study
which covered the advantages and differences between the Google Scholar, Scopus, Web of
Science and PubMed search engines.
Second, searches were conducted for impact evaluations of two test topics: deworming
and toilets. EBSCO, IDEAS, Google Scholar, JOLISPlus, JSTOR, Proquest, PubMed,
ScienceDirect, SciVerse, SpringerLink, Wiley Online Library and the World Bank eLibrary
were used for these searches. 9 search strings were tried for deworming and up to 33 strings
for toilets, with modifications as needed for each search engine. For each search the number
of results and the number of results out of the first 10-50 results which appeared to be
impact evaluations of the topic in question were recorded. This gave a better sense of which
search engines and which kinds of search strings would return both comprehensive and
relevant results. A qualitative assessment of the search results was also provided for the
Google Scholar and SciVerse searches.
Finally, the online databases of J-PAL, IPA, CEGA and 3ie were searched. Since these
databases are already narrowly focused on impact evaluations, attention was restricted to
simple keyword searches, checking whether the search engines that were integrated with
each database seemed to pull up relevant results for each topic.
Ultimately, Google Scholar and the online databases of J-PAL, IPA, CEGA and 3ie,
along with EBSCO/PubMed for health-related interventions, were selected for use in the
full searches.
45
After the interventions of interest were identified, search strings were developed and
tested using each search source. Each search string included methodology-specific stock
keywords that narrowed the search to impact evaluation studies, except for the search
strings for the J-PAL, IPA, CEGA and 3ie searches, as these databases already exclusively
focus on impact evaluations.
Experimentation with keyword combinations in stages 1.4 and 2.1 was helpful in the
development of the search strings. The search strings could take slightly different forms for
different search engines. Search terms were tailored to the search source, and a full list is
included in an appendix.
C# was used to write a script to scrape the results from search engines. The script
was programmed to ensure that the Boolean logic of the search string was properly applied
within the constraints of each search engines capabilities.
Some sources were specialized and could have useful papers that do not turn up in
simple searches. The papers listed on J-PAL, IPA, CEGA and 3ies websites are a good
example of this. For these sites, it made more sense for the papers to be manually searched
and added to the relevant spreadsheets. After the automated and manual searches were
complete, duplicates were removed by matching on author and title names.
During the title screening stage, the consolidated list of citations yielded by the scraped
searches was checked for any existing meta-analyses or systematic reviews. Any papers that
these papers included were added to the list. With these references added, duplicates were
again flagged and removed.
Stage 3: Screening
Generic and topic-specific screening criteria were developed. The generic screening crite-
ria are detailed below, as is an example of a set of topic-specific screening criteria.
The screening criteria were very inclusive overall. This is because AidGrade purposely
follows a different approach to most meta-analyses in the hopes that the data collected can
be re-used by researchers who want to focus on a different subset of papers. Their motiva-
tion is that vast resources are typically devoted to a meta-analysis, but if another team of
researchers thinks a different set of papers should be used, they will have scour the literature
and recreate the data from scratch. If the two groups disagree, all the public sees are their
two sets of findings and their reasoning for selecting different papers. AidGrade instead
strives to cover the superset of all impact evaluations one might wish to include along with a
list of their characteristics (e.g. where they were conducted, whether they were randomized
by individual or by cluster, etc.) and let people set their own filters on the papers or select
individual papers and view the entire space of possible results.
46
Figure 7: Generic Screening Criteria
Category Inclusion Criteria Exclusion Criteria
Methodologies Impact evaluations that have counterfactuals Observational studies,
strictly qualitative studies
Publication status Peer-reviewed or working paper N/A
Time period of study Any N/A
LocationGeography Any N/A
Quality Any N/A
Figure 8: Topic-Specific Criteria Example: Formal Banking
Category Inclusion Criteria Exclusion Criteria
Intervention Formal banking services specifically including: Other formal banking services
- Expansion of credit and/or savings Microfinance
- Provision of technological innovations
- Introduction or expansion of financial education,
or other program to increase financial literacy
or awareness
Outcomes - Individual and household income N/A
- Small and micro-business income
- Household and business assets
- Household consumption
- Small and micro-business investment
- Small, micro-business or agricultural output
- Measures of poverty
- Measures of well-being or stress
- Business ownership
- Any other outcome covered by multiple papers
Figure 11 illustrates the difference.
For this reason, minimal screening was done during the screening stage. Instead, data
was collected broadly and re-screening was allowed at the point of doing the analysis. This
is highly beneficial for the purpose of this paper, as it allows us to look at the largest
possible set of papers and all subsets.
After screening criteria were developed, two volunteers independently screened the titles
to determine which papers in the spreadsheet were likely to meet the screening criteria
developed in Stage 3.1. Any differences in coding were arbitrated by a third volunteer. All
volunteers received training before beginning, based on the AidGrade Training Manual and
a test set of entries. Volunteers’ training inputs were screened to ensure that only proficient
47
Figure 9: AidGrade’s Strategy
48
volunteers would be allowed to continue. Of those papers that passed the title screening,
two volunteers independently determined whether the papers in the spreadsheet met the
screening criteria developed in Stage 3.1 judging by the paper abstracts. Any differences in
coding were again arbitrated by a third volunteer. The full text was then found for those
papers which passed both the title and abstract checks. Any paper that proved not to
be a relevant impact evaluation using the aforementioned criteria was discarded at this stage.
Stage 4: Coding
Two AidGrade members each independently used the data extraction form developed
in Stage 4.1 to extract data from the papers that passed the screening in Stage 3. Any
disputes were arbitrated by a third AidGrade member. These AidGrade members received
much more training than those who screened the papers, reflecting the increased difficulty
of their work, and also did a test set of entries before being allowed to proceed. The data
extraction form was organized into three sections: (1) general identifying information; (2)
paper and study characteristics; and (3) results. Each section contained qualitative and
quantitative variables that captured the characteristics and results of the study.
Stage 5: Analysis
A researcher was assigned to each meta-analysis topic who could specialize in determin-
ing which of the interventions and results were similar enough to be combined. If in doubt,
researchers could consult the original papers. In general, researchers were encouraged to
focus on all the outcome variables for which multiple papers had results.
When a study had multiple treatment arms sharing the same control, researchers would
check whether enough data was provided in the original paper to allow estimates to be
combined before the meta-analysis was run. This is a best practice to avoid double-counting
the control group; for details, see the Cochrane Handbook for Systematic Reviews of
Interventions (2011). If a paper did not provide sufficient data for this, the researcher would
make the decision as to which treatment arm to focus on. Data were then standardized
within each topic to be more comparable before analysis (for example, units were converted).
The subsequent steps of the meta-analysis process are irrelevant for the purposes of
this paper. It should be noted that the first set of ten topics followed a slightly different
procedure for stages (1) and (2). Only one list of potential topics was created in Stage
1.1, so Stage 1.2 (Consolidation of Lists) was only vacuously followed. There was also no
randomization after public voting (Stage 1.7) and no scripted scraping searches (Stage 2.3),
as all searches were manually conducted using specific strings. A different search engine was
49
also used: SciVerse Hub, an aggreator that includes SciVerse Scopus, MEDLINE, PubMed
Central, ArXiv.org, and many other databases of articles, books and presentations. The
search strings for both rounds of meta-analysis, manual and scripted, are detailed in another
appendix.
50
C Additional Results
51
Table 13: Descriptive Statistics: Standardized Narrowly Defined Outcomes
Intervention Outcome # Neg sig papers # Insig papers # Pos sig papers # Papers
Conditional cash transfers Attendance rate 0 6 9 15
Conditional cash transfers Enrollment rate 0 6 31 37
Conditional cash transfers Height 0 1 1 2
Conditional cash transfers Height-for-age 0 6 1 7
Conditional cash transfers Labor force participation 1 12 5 18
Conditional cash transfers Probability unpaid work 1 0 4 5
Conditional cash transfers Test scores 1 2 2 5
Conditional cash transfers Unpaid labor 0 2 3 5
Conditional cash transfers Weight-for-age 0 2 0 2
Conditional cash transfers Weight-for-height 0 1 1 2
HIV/AIDS Education Pregnancy rate 0 2 0 2
HIV/AIDS Education Probability has multiple sex partners 0 1 1 2
HIV/AIDS Education Used contraceptives 1 6 3 10
Unconditional cash transfers Enrollment rate 0 3 8 11
Unconditional cash transfers Test scores 0 1 1 2
Unconditional cash transfers Weight-for-height 0 2 0 2
Insecticide-treated bed nets Malaria 0 3 6 9
Contract teachers Test scores 0 1 2 3
Deworming Attendance rate 0 1 1 2
Deworming Birthweight 0 2 0 2
Deworming Diarrhea incidence 0 1 1 2
Deworming Height 3 10 4 17
Deworming Height-for-age 1 9 4 14
Deworming Hemoglobin 0 13 2 15
Deworming Malformations 0 2 0 2
Deworming Mid-upper arm circumference 2 0 5 7
Deworming Test scores 0 0 2 2
Deworming Weight 3 8 7 18
Deworming Weight-for-age 1 6 5 12
Deworming Weight-for-height 2 7 2 11
Financial literacy Savings 0 2 3 5
Improved stoves Chest pain 0 0 2 2
Improved stoves Cough 0 0 2 2
Improved stoves Difficulty breathing 0 0 2 2
Improved stoves Excessive nasal secretion 0 1 1 2
Irrigation Consumption 0 1 1 2
Irrigation Total income 0 1 1 2
52
Microfinance Assets 0 3 1 4
Microfinance Consumption 0 2 0 2
Microfinance Profits 1 3 1 5
Microfinance Savings 0 3 0 3
Microfinance Total income 0 3 2 5
Micro health insurance Enrollment rate 0 1 1 2
Micronutrient supplementation Birthweight 0 4 3 7
Micronutrient supplementation Body mass index 0 1 4 5
Micronutrient supplementation Cough prevalence 0 3 0 3
Micronutrient supplementation Diarrhea incidence 1 5 5 11
Micronutrient supplementation Diarrhea prevalence 0 5 1 6
Micronutrient supplementation Fever incidence 0 2 0 2
Micronutrient supplementation Fever prevalence 1 2 2 5
Micronutrient supplementation Height 3 22 7 32
Micronutrient supplementation Height-for-age 5 23 8 36
Micronutrient supplementation Hemoglobin 7 11 29 47
Micronutrient supplementation Malaria 0 2 0 2
Micronutrient supplementation Mid-upper arm circumference 2 9 7 18
Micronutrient supplementation Mortality rate 0 12 0 12
Micronutrient supplementation Perinatal deaths 1 5 0 6
Micronutrient supplementation Prevalence of anemia 0 6 9 15
Micronutrient supplementation Stillbirths 0 4 0 4
Micronutrient supplementation Stunted 0 5 0 5
Micronutrient supplementation Test scores 1 2 7 10
Micronutrient supplementation Triceps skinfold measurement 1 0 1 2
Micronutrient supplementation Wasted 0 2 0 2
Micronutrient supplementation Weight 4 19 13 36
Micronutrient supplementation Weight-for-age 1 23 10 34
Micronutrient supplementation Weight-for-height 0 18 8 26
Mobile phone-based reminders Appointment attendance rate 1 0 2 3
Mobile phone-based reminders Treatment adherence 1 3 1 5
Performance pay Test scores 0 2 1 3
Rural electrification Enrollment rate 0 1 2 3
Rural electrification Study time 0 1 2 3
Rural electrification Total income 0 2 0 2
Safe water storage Diarrhea incidence 0 1 1 2
Scholarships Attendance rate 0 1 1 2
Scholarships Enrollment rate 0 2 3 5
Scholarships Test scores 0 2 0 2
53
School meals Enrollment rate 0 3 0 3
School meals Height-for-age 0 2 0 2
School meals Test scores 0 2 1 3
Water treatment Diarrhea incidence 0 1 1 2
Water treatment Diarrhea prevalence 0 1 5 6
Women’s empowerment programs Savings 0 1 1 2
Women’s empowerment programs Total income 0 0 2 2
Average 0.6 4.2 3.2 7.9
54
Table 14: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes, Including Outlier
Intervention Outcome var(Yi) CV(Yi) I2
Microfinance Assets 0.000 5.508 0.999
Rural Electrification Enrollment rate 0.001 0.129 0.993
Micronutrients Cough prevalence 0.001 1.648 0.829
Microfinance Total income 0.001 0.989 0.998
Microfinance Savings 0.002 1.773 0.922
Financial Literacy Savings 0.004 5.472 0.979
Microfinance Profits 0.005 5.448 0.519
Contract Teachers Test scores 0.005 0.403 0.998
Performance Pay Test scores 0.006 0.608 0.552
Micronutrients Body mass index 0.007 0.675 1.000
Conditional Cash Transfers Unpaid labor 0.009 0.918 0.836
Micronutrients Weight-for-age 0.009 1.941 0.663
Micronutrients Weight-for-height 0.010 2.148 0.416
Micronutrients Birthweight 0.010 0.981 0.997
Micronutrients Height-for-age 0.012 2.467 0.640
Conditional Cash Transfers Test scores 0.013 1.866 0.887
Deworming Hemoglobin 0.015 3.377 0.996
Micronutrients Mid-upper arm circumference 0.015 2.078 0.317
SMS Reminders Treatment adherence 0.022 1.672 0.050
Micronutrients Height 0.023 4.369 0.991
Micronutrients Mortality rate 0.025 2.880 0.698
Micronutrients Stunted 0.025 1.110 0.665
Bed Nets Malaria 0.029 0.497 1.000
Conditional Cash Transfers Attendance rate 0.030 0.523 0.362
Micronutrients Weight 0.034 2.705 0.708
HIV/AIDS Education Used contraceptives 0.037 3.044 0.867
Micronutrients Perinatal deaths 0.038 2.096 0.108
Deworming Height 0.049 2.310 0.995
Micronutrients Test scores 0.052 1.694 0.891
Conditional Cash Transfers Height-for-age 0.055 22.166 0.125
Conditional Cash Transfers Enrollment rate 0.056 1.287 1.000
Deworming Weight-for-height 0.072 3.129 0.910
Micronutrients Stillbirths 0.075 3.041 0.955
Micronutrients Prevalence of anemia 0.095 0.793 0.268
Deworming Height-for-age 0.098 1.978 0.944
Deworming Weight-for-age 0.107 2.287 0.993
Micronutrients Diarrhea incidence 0.109 3.300 0.663
Micronutrients Diarrhea prevalence 0.111 1.205 0.815
Micronutrients Fever prevalence 0.146 3.076 0.959
Deworming Weight 0.165 3.897 0.999
Micronutrients Hemoglobin 0.215 1.439 0.269
55
SMS Reminders Appointment attendance rate 0.224 2.908 0.913
Deworming Mid-upper arm circumference 0.439 1.773 1.000
Conditional Cash Transfers Probability unpaid work 0.609 6.415 1.000
Conditional Cash Transfers Labor force participation 0.789 2.972 0.461
56
How Much Can We Generalize? Measuring the External Validity of Impact Evaluations
How Much Can We Generalize? Measuring the External Validity of Impact Evaluations
How Much Can We Generalize? Measuring the External Validity of Impact Evaluations
How Much Can We Generalize? Measuring the External Validity of Impact Evaluations
How Much Can We Generalize? Measuring the External Validity of Impact Evaluations

Contenu connexe

Tendances

The Role of Economic Evaluation and Cost-Effectiveness in Program Science
The Role of Economic Evaluation and Cost-Effectiveness in Program ScienceThe Role of Economic Evaluation and Cost-Effectiveness in Program Science
The Role of Economic Evaluation and Cost-Effectiveness in Program Scienceamusten
 
Community-based Evaluation Methods and Practice
Community-based Evaluation Methods and PracticeCommunity-based Evaluation Methods and Practice
Community-based Evaluation Methods and Practiceamusten
 
Women who choose Computer Science - what really matters
Women who choose Computer Science - what really mattersWomen who choose Computer Science - what really matters
Women who choose Computer Science - what really mattersWBDC of Florida
 
Meaning and uses of statistics
Meaning and uses of statisticsMeaning and uses of statistics
Meaning and uses of statisticsRekhaChoudhary24
 
SCOPE, IMPORTANCE & USES OF STATISTICS
SCOPE, IMPORTANCE & USES OF STATISTICS      SCOPE, IMPORTANCE & USES OF STATISTICS
SCOPE, IMPORTANCE & USES OF STATISTICS Muhammad Yousaf
 
rti_innovation_brief_meta-evaluation
rti_innovation_brief_meta-evaluationrti_innovation_brief_meta-evaluation
rti_innovation_brief_meta-evaluationAnupa Bir
 
Using case studies to explore the generalizability of 'complex' development i...
Using case studies to explore the generalizability of 'complex' development i...Using case studies to explore the generalizability of 'complex' development i...
Using case studies to explore the generalizability of 'complex' development i...Barb Knittel
 
Using Real Life Examples to Teach Abstract Statistical Concepts
Using Real Life Examples to Teach Abstract Statistical ConceptsUsing Real Life Examples to Teach Abstract Statistical Concepts
Using Real Life Examples to Teach Abstract Statistical ConceptsMuhammad Qamar Shafiq
 
Evidence based policy
Evidence based policy Evidence based policy
Evidence based policy pasicUganda
 
Chaplowe - M&E Planning 2008 - shortcuts
Chaplowe - M&E Planning 2008 - shortcutsChaplowe - M&E Planning 2008 - shortcuts
Chaplowe - M&E Planning 2008 - shortcutssgchaplowe
 
Chapter 1 introduction to statistics.
Chapter 1 introduction to statistics.Chapter 1 introduction to statistics.
Chapter 1 introduction to statistics.OliviaNightingale2
 
How NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling DataHow NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling DataDataCards
 
Master scriptie ricardo uijen
Master scriptie ricardo uijenMaster scriptie ricardo uijen
Master scriptie ricardo uijenInvoering CJG
 
Types of Statistics
Types of StatisticsTypes of Statistics
Types of Statisticsloranel
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsSantosh Bhandari
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 
What is evidence-based policy?
What is evidence-based policy?What is evidence-based policy?
What is evidence-based policy?Demos Helsinki
 

Tendances (19)

The Role of Economic Evaluation and Cost-Effectiveness in Program Science
The Role of Economic Evaluation and Cost-Effectiveness in Program ScienceThe Role of Economic Evaluation and Cost-Effectiveness in Program Science
The Role of Economic Evaluation and Cost-Effectiveness in Program Science
 
Community-based Evaluation Methods and Practice
Community-based Evaluation Methods and PracticeCommunity-based Evaluation Methods and Practice
Community-based Evaluation Methods and Practice
 
Women who choose Computer Science - what really matters
Women who choose Computer Science - what really mattersWomen who choose Computer Science - what really matters
Women who choose Computer Science - what really matters
 
Meaning and uses of statistics
Meaning and uses of statisticsMeaning and uses of statistics
Meaning and uses of statistics
 
SCOPE, IMPORTANCE & USES OF STATISTICS
SCOPE, IMPORTANCE & USES OF STATISTICS      SCOPE, IMPORTANCE & USES OF STATISTICS
SCOPE, IMPORTANCE & USES OF STATISTICS
 
rti_innovation_brief_meta-evaluation
rti_innovation_brief_meta-evaluationrti_innovation_brief_meta-evaluation
rti_innovation_brief_meta-evaluation
 
Using case studies to explore the generalizability of 'complex' development i...
Using case studies to explore the generalizability of 'complex' development i...Using case studies to explore the generalizability of 'complex' development i...
Using case studies to explore the generalizability of 'complex' development i...
 
Using Real Life Examples to Teach Abstract Statistical Concepts
Using Real Life Examples to Teach Abstract Statistical ConceptsUsing Real Life Examples to Teach Abstract Statistical Concepts
Using Real Life Examples to Teach Abstract Statistical Concepts
 
Evidence based policy
Evidence based policy Evidence based policy
Evidence based policy
 
Chaplowe - M&E Planning 2008 - shortcuts
Chaplowe - M&E Planning 2008 - shortcutsChaplowe - M&E Planning 2008 - shortcuts
Chaplowe - M&E Planning 2008 - shortcuts
 
Chapter 1 introduction to statistics.
Chapter 1 introduction to statistics.Chapter 1 introduction to statistics.
Chapter 1 introduction to statistics.
 
How NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling DataHow NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling Data
 
Master scriptie ricardo uijen
Master scriptie ricardo uijenMaster scriptie ricardo uijen
Master scriptie ricardo uijen
 
Types of Statistics
Types of StatisticsTypes of Statistics
Types of Statistics
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Social Statistics
Social StatisticsSocial Statistics
Social Statistics
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
CRM task 2
CRM task 2CRM task 2
CRM task 2
 
What is evidence-based policy?
What is evidence-based policy?What is evidence-based policy?
What is evidence-based policy?
 

En vedette

Bank Location and Financial Liberalization Reforms: Evidence from Microgeogra...
Bank Location and Financial Liberalization Reforms: Evidence from Microgeogra...Bank Location and Financial Liberalization Reforms: Evidence from Microgeogra...
Bank Location and Financial Liberalization Reforms: Evidence from Microgeogra...Stockholm Institute of Transition Economics
 
Aid Effectiveness in Times of Political Change: Lessons from the Post-Communi...
Aid Effectiveness in Times of Political Change: Lessons from the Post-Communi...Aid Effectiveness in Times of Political Change: Lessons from the Post-Communi...
Aid Effectiveness in Times of Political Change: Lessons from the Post-Communi...Stockholm Institute of Transition Economics
 
Minorities and Long-run Development: Persistence of Armenian and Greek In uen...
Minorities and Long-run Development: Persistence of Armenian and Greek In uen...Minorities and Long-run Development: Persistence of Armenian and Greek In uen...
Minorities and Long-run Development: Persistence of Armenian and Greek In uen...Stockholm Institute of Transition Economics
 
Union Leaders as Experts: Wage Bargaining and Strikes with Union-Wide Ballot ...
Union Leaders as Experts: Wage Bargaining and Strikes with Union-Wide Ballot ...Union Leaders as Experts: Wage Bargaining and Strikes with Union-Wide Ballot ...
Union Leaders as Experts: Wage Bargaining and Strikes with Union-Wide Ballot ...Stockholm Institute of Transition Economics
 
Comments on "Reforming an Institutional Culture: A Model of Motivated Agents ...
Comments on "Reforming an Institutional Culture: A Model of Motivated Agents ...Comments on "Reforming an Institutional Culture: A Model of Motivated Agents ...
Comments on "Reforming an Institutional Culture: A Model of Motivated Agents ...Stockholm Institute of Transition Economics
 
Religion, Politics, and Development Essays in Development Economics and Polit...
Religion, Politics, and Development Essays in Development Economics and Polit...Religion, Politics, and Development Essays in Development Economics and Polit...
Religion, Politics, and Development Essays in Development Economics and Polit...Stockholm Institute of Transition Economics
 

En vedette (17)

Voluntariness and the Coase Theorem
Voluntariness and the Coase TheoremVoluntariness and the Coase Theorem
Voluntariness and the Coase Theorem
 
Bank Location and Financial Liberalization Reforms: Evidence from Microgeogra...
Bank Location and Financial Liberalization Reforms: Evidence from Microgeogra...Bank Location and Financial Liberalization Reforms: Evidence from Microgeogra...
Bank Location and Financial Liberalization Reforms: Evidence from Microgeogra...
 
The Nordic low carbon transition and lessons for other countries
The Nordic low carbon transition and lessons for other countriesThe Nordic low carbon transition and lessons for other countries
The Nordic low carbon transition and lessons for other countries
 
Aid Effectiveness in Times of Political Change: Lessons from the Post-Communi...
Aid Effectiveness in Times of Political Change: Lessons from the Post-Communi...Aid Effectiveness in Times of Political Change: Lessons from the Post-Communi...
Aid Effectiveness in Times of Political Change: Lessons from the Post-Communi...
 
Minorities and Long-run Development: Persistence of Armenian and Greek In uen...
Minorities and Long-run Development: Persistence of Armenian and Greek In uen...Minorities and Long-run Development: Persistence of Armenian and Greek In uen...
Minorities and Long-run Development: Persistence of Armenian and Greek In uen...
 
Union Leaders as Experts: Wage Bargaining and Strikes with Union-Wide Ballot ...
Union Leaders as Experts: Wage Bargaining and Strikes with Union-Wide Ballot ...Union Leaders as Experts: Wage Bargaining and Strikes with Union-Wide Ballot ...
Union Leaders as Experts: Wage Bargaining and Strikes with Union-Wide Ballot ...
 
Leniency, Asymmetric Punishment and Corruption: Evidence from China
Leniency, Asymmetric Punishment and Corruption: Evidence from ChinaLeniency, Asymmetric Punishment and Corruption: Evidence from China
Leniency, Asymmetric Punishment and Corruption: Evidence from China
 
Comments on "Reforming an Institutional Culture: A Model of Motivated Agents ...
Comments on "Reforming an Institutional Culture: A Model of Motivated Agents ...Comments on "Reforming an Institutional Culture: A Model of Motivated Agents ...
Comments on "Reforming an Institutional Culture: A Model of Motivated Agents ...
 
Productivity in Contests: Organizational Culture and Personality Effects
Productivity in Contests: Organizational Culture and Personality EffectsProductivity in Contests: Organizational Culture and Personality Effects
Productivity in Contests: Organizational Culture and Personality Effects
 
The Length of Contracts and Collusion
The Length of Contracts and CollusionThe Length of Contracts and Collusion
The Length of Contracts and Collusion
 
Islamic Rule and the Empowerment of the Poor and Pious
Islamic Rule and the Empowerment of the Poor and PiousIslamic Rule and the Empowerment of the Poor and Pious
Islamic Rule and the Empowerment of the Poor and Pious
 
Religion, Politics, and Development Essays in Development Economics and Polit...
Religion, Politics, and Development Essays in Development Economics and Polit...Religion, Politics, and Development Essays in Development Economics and Polit...
Religion, Politics, and Development Essays in Development Economics and Polit...
 
Public Procurement Thresholds in Sweden
Public Procurement Thresholds in SwedenPublic Procurement Thresholds in Sweden
Public Procurement Thresholds in Sweden
 
On the Timing of Turkey’s Authoritarian Turn
On the Timing of Turkey’s Authoritarian TurnOn the Timing of Turkey’s Authoritarian Turn
On the Timing of Turkey’s Authoritarian Turn
 
Financing Africa’s Sustainable Development
Financing Africa’s Sustainable DevelopmentFinancing Africa’s Sustainable Development
Financing Africa’s Sustainable Development
 
Discretion and Public Procurement: a Complex Relationship
Discretion and Public Procurement: a Complex RelationshipDiscretion and Public Procurement: a Complex Relationship
Discretion and Public Procurement: a Complex Relationship
 
Wind energy – challenges
Wind energy – challengesWind energy – challenges
Wind energy – challenges
 

Similaire à How Much Can We Generalize? Measuring the External Validity of Impact Evaluations

06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx
06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx
06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docxsmithhedwards48727
 
r The Impact of Social Policy Pranab Chatterjee an.docx
r The Impact of Social Policy Pranab Chatterjee an.docxr The Impact of Social Policy Pranab Chatterjee an.docx
r The Impact of Social Policy Pranab Chatterjee an.docxaudeleypearl
 
External Validity and Policy AdaptationFrom Impact Evalua.docx
External Validity and Policy AdaptationFrom Impact Evalua.docxExternal Validity and Policy AdaptationFrom Impact Evalua.docx
External Validity and Policy AdaptationFrom Impact Evalua.docxmecklenburgstrelitzh
 
An Approach To Consider The Impact Of Co-Designed Science Case Study Of Baye...
An Approach To Consider The Impact Of Co-Designed Science  Case Study Of Baye...An Approach To Consider The Impact Of Co-Designed Science  Case Study Of Baye...
An Approach To Consider The Impact Of Co-Designed Science Case Study Of Baye...Joshua Gorinson
 
A Primer For Applying Propensity-Score Matching
A Primer For Applying Propensity-Score MatchingA Primer For Applying Propensity-Score Matching
A Primer For Applying Propensity-Score MatchingMichele Thomas
 
Running Head DATA SOURCE EVALUATION .docx
Running Head DATA SOURCE EVALUATION                              .docxRunning Head DATA SOURCE EVALUATION                              .docx
Running Head DATA SOURCE EVALUATION .docxtodd271
 
1-s2.0-S0149718916300787-main
1-s2.0-S0149718916300787-main1-s2.0-S0149718916300787-main
1-s2.0-S0149718916300787-mainCristina Sette
 
EJ1241940.pdf
EJ1241940.pdfEJ1241940.pdf
EJ1241940.pdfDeheMail
 
SOCW 6311 wk 11 discussion 1 peer responses Respond to a.docx
SOCW 6311 wk 11 discussion 1 peer responses Respond to a.docxSOCW 6311 wk 11 discussion 1 peer responses Respond to a.docx
SOCW 6311 wk 11 discussion 1 peer responses Respond to a.docxsamuel699872
 
Can systematic reviews help identify what works and why?
Can systematic reviews help identify what works and why?Can systematic reviews help identify what works and why?
Can systematic reviews help identify what works and why?Carina van Rooyen
 
Keynote 1. How can you tell if is not working? Evaluating the impact of educa...
Keynote 1. How can you tell if is not working? Evaluating the impact of educa...Keynote 1. How can you tell if is not working? Evaluating the impact of educa...
Keynote 1. How can you tell if is not working? Evaluating the impact of educa...CONUL Teaching & Learning
 
SWK 421 Research & Statistical Methods in Social WorkResearch.docx
SWK 421 Research & Statistical Methods in Social WorkResearch.docxSWK 421 Research & Statistical Methods in Social WorkResearch.docx
SWK 421 Research & Statistical Methods in Social WorkResearch.docxssuserf9c51d
 
Running head LOGIC MODELLOGIC MODEL 2Logic modelStu.docx
Running head LOGIC MODELLOGIC MODEL 2Logic modelStu.docxRunning head LOGIC MODELLOGIC MODEL 2Logic modelStu.docx
Running head LOGIC MODELLOGIC MODEL 2Logic modelStu.docxwlynn1
 
IssaPopulation and SamplingThe constructs of population and sa.docx
IssaPopulation and SamplingThe constructs of population and sa.docxIssaPopulation and SamplingThe constructs of population and sa.docx
IssaPopulation and SamplingThe constructs of population and sa.docxvrickens
 
Adding New Dimensions To Case Study Evaluations The Case Of Evaluating Compr...
Adding New Dimensions To Case Study Evaluations  The Case Of Evaluating Compr...Adding New Dimensions To Case Study Evaluations  The Case Of Evaluating Compr...
Adding New Dimensions To Case Study Evaluations The Case Of Evaluating Compr...Scott Faria
 
The field of program evaluation presents a diversity of images a.docx
The field of program evaluation presents a diversity of images a.docxThe field of program evaluation presents a diversity of images a.docx
The field of program evaluation presents a diversity of images a.docxcherry686017
 
Prediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey dataPrediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey dataAlex Papageorgiou
 

Similaire à How Much Can We Generalize? Measuring the External Validity of Impact Evaluations (20)

06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx
06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx
06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx
 
r The Impact of Social Policy Pranab Chatterjee an.docx
r The Impact of Social Policy Pranab Chatterjee an.docxr The Impact of Social Policy Pranab Chatterjee an.docx
r The Impact of Social Policy Pranab Chatterjee an.docx
 
External Validity and Policy AdaptationFrom Impact Evalua.docx
External Validity and Policy AdaptationFrom Impact Evalua.docxExternal Validity and Policy AdaptationFrom Impact Evalua.docx
External Validity and Policy AdaptationFrom Impact Evalua.docx
 
An Approach To Consider The Impact Of Co-Designed Science Case Study Of Baye...
An Approach To Consider The Impact Of Co-Designed Science  Case Study Of Baye...An Approach To Consider The Impact Of Co-Designed Science  Case Study Of Baye...
An Approach To Consider The Impact Of Co-Designed Science Case Study Of Baye...
 
A Primer For Applying Propensity-Score Matching
A Primer For Applying Propensity-Score MatchingA Primer For Applying Propensity-Score Matching
A Primer For Applying Propensity-Score Matching
 
Running Head DATA SOURCE EVALUATION .docx
Running Head DATA SOURCE EVALUATION                              .docxRunning Head DATA SOURCE EVALUATION                              .docx
Running Head DATA SOURCE EVALUATION .docx
 
1-s2.0-S0149718916300787-main
1-s2.0-S0149718916300787-main1-s2.0-S0149718916300787-main
1-s2.0-S0149718916300787-main
 
EJ1241940.pdf
EJ1241940.pdfEJ1241940.pdf
EJ1241940.pdf
 
The Politics of Aid Effectiveness: Why Better Tools can Make for Worse Outcomes
The Politics of Aid Effectiveness: Why Better Tools can Make for Worse OutcomesThe Politics of Aid Effectiveness: Why Better Tools can Make for Worse Outcomes
The Politics of Aid Effectiveness: Why Better Tools can Make for Worse Outcomes
 
SOCW 6311 wk 11 discussion 1 peer responses Respond to a.docx
SOCW 6311 wk 11 discussion 1 peer responses Respond to a.docxSOCW 6311 wk 11 discussion 1 peer responses Respond to a.docx
SOCW 6311 wk 11 discussion 1 peer responses Respond to a.docx
 
Can systematic reviews help identify what works and why?
Can systematic reviews help identify what works and why?Can systematic reviews help identify what works and why?
Can systematic reviews help identify what works and why?
 
Keynote 1. How can you tell if is not working? Evaluating the impact of educa...
Keynote 1. How can you tell if is not working? Evaluating the impact of educa...Keynote 1. How can you tell if is not working? Evaluating the impact of educa...
Keynote 1. How can you tell if is not working? Evaluating the impact of educa...
 
SWK 421 Research & Statistical Methods in Social WorkResearch.docx
SWK 421 Research & Statistical Methods in Social WorkResearch.docxSWK 421 Research & Statistical Methods in Social WorkResearch.docx
SWK 421 Research & Statistical Methods in Social WorkResearch.docx
 
Running head LOGIC MODELLOGIC MODEL 2Logic modelStu.docx
Running head LOGIC MODELLOGIC MODEL 2Logic modelStu.docxRunning head LOGIC MODELLOGIC MODEL 2Logic modelStu.docx
Running head LOGIC MODELLOGIC MODEL 2Logic modelStu.docx
 
IssaPopulation and SamplingThe constructs of population and sa.docx
IssaPopulation and SamplingThe constructs of population and sa.docxIssaPopulation and SamplingThe constructs of population and sa.docx
IssaPopulation and SamplingThe constructs of population and sa.docx
 
Adding New Dimensions To Case Study Evaluations The Case Of Evaluating Compr...
Adding New Dimensions To Case Study Evaluations  The Case Of Evaluating Compr...Adding New Dimensions To Case Study Evaluations  The Case Of Evaluating Compr...
Adding New Dimensions To Case Study Evaluations The Case Of Evaluating Compr...
 
The field of program evaluation presents a diversity of images a.docx
The field of program evaluation presents a diversity of images a.docxThe field of program evaluation presents a diversity of images a.docx
The field of program evaluation presents a diversity of images a.docx
 
Prediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey dataPrediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey data
 
How To Write A Essay Proposal
How To Write A Essay ProposalHow To Write A Essay Proposal
How To Write A Essay Proposal
 
Aresty poster
Aresty posterAresty poster
Aresty poster
 

Plus de Stockholm Institute of Transition Economics

The Russia sanctions as a human rights instrument: Violations of export contr...
The Russia sanctions as a human rights instrument: Violations of export contr...The Russia sanctions as a human rights instrument: Violations of export contr...
The Russia sanctions as a human rights instrument: Violations of export contr...Stockholm Institute of Transition Economics
 
Unemployment and Intra-Household Dynamics: the Effect of Male Job Loss on Int...
Unemployment and Intra-Household Dynamics: the Effect of Male Job Loss on Int...Unemployment and Intra-Household Dynamics: the Effect of Male Job Loss on Int...
Unemployment and Intra-Household Dynamics: the Effect of Male Job Loss on Int...Stockholm Institute of Transition Economics
 
Paid Work for Women and Domestic Violence: Evidence from the Rwandan Coffee M...
Paid Work for Women and Domestic Violence: Evidence from the Rwandan Coffee M...Paid Work for Women and Domestic Violence: Evidence from the Rwandan Coffee M...
Paid Work for Women and Domestic Violence: Evidence from the Rwandan Coffee M...Stockholm Institute of Transition Economics
 
Domestic Violence Legislation - Awareness and Support in Latvia, Russia and U...
Domestic Violence Legislation - Awareness and Support in Latvia, Russia and U...Domestic Violence Legislation - Awareness and Support in Latvia, Russia and U...
Domestic Violence Legislation - Awareness and Support in Latvia, Russia and U...Stockholm Institute of Transition Economics
 
Perceptions of violence and their socio-economic determinants: a comparative ...
Perceptions of violence and their socio-economic determinants: acomparative ...Perceptions of violence and their socio-economic determinants: acomparative ...
Perceptions of violence and their socio-economic determinants: a comparative ...Stockholm Institute of Transition Economics
 

Plus de Stockholm Institute of Transition Economics (20)

Tracking sanctions compliance | SITE 2023 Development Day conference
Tracking sanctions compliance | SITE 2023 Development Day conferenceTracking sanctions compliance | SITE 2023 Development Day conference
Tracking sanctions compliance | SITE 2023 Development Day conference
 
War and Trade in Eurasia | SITE 2023 Development Day conference
War and Trade in Eurasia | SITE 2023 Development Day conferenceWar and Trade in Eurasia | SITE 2023 Development Day conference
War and Trade in Eurasia | SITE 2023 Development Day conference
 
Reducing the Russian Economic Capacity and support Ukraine
Reducing the Russian Economic Capacity and support UkraineReducing the Russian Economic Capacity and support Ukraine
Reducing the Russian Economic Capacity and support Ukraine
 
Energy sanctions - What else can be done?
Energy sanctions - What else can be done?Energy sanctions - What else can be done?
Energy sanctions - What else can be done?
 
How should policy be designed during energy-economic warfare?
How should policy be designed during energy-economic warfare?How should policy be designed during energy-economic warfare?
How should policy be designed during energy-economic warfare?
 
The impact of the war on Russia’s fossil fuel earnings
The impact of the war on Russia’s fossil fuel earningsThe impact of the war on Russia’s fossil fuel earnings
The impact of the war on Russia’s fossil fuel earnings
 
The Russia sanctions as a human rights instrument: Violations of export contr...
The Russia sanctions as a human rights instrument: Violations of export contr...The Russia sanctions as a human rights instrument: Violations of export contr...
The Russia sanctions as a human rights instrument: Violations of export contr...
 
SITE 2022 Development Day conference program
SITE 2022 Development Day conference programSITE 2022 Development Day conference program
SITE 2022 Development Day conference program
 
SITE 2022 Development Day conference | Program
SITE 2022 Development Day conference | ProgramSITE 2022 Development Day conference | Program
SITE 2022 Development Day conference | Program
 
Program | SITE 2022 Development Day conference
Program | SITE 2022 Development Day conferenceProgram | SITE 2022 Development Day conference
Program | SITE 2022 Development Day conference
 
Ce^2 Conference 2022 Programme
Ce^2 Conference 2022 ProgrammeCe^2 Conference 2022 Programme
Ce^2 Conference 2022 Programme
 
Ce2 Worksop & Conference 2022 Program
Ce2 Worksop & Conference 2022 ProgramCe2 Worksop & Conference 2022 Program
Ce2 Worksop & Conference 2022 Program
 
Ce^2 Conference 2022 Program
Ce^2 Conference 2022 ProgramCe^2 Conference 2022 Program
Ce^2 Conference 2022 Program
 
(Ce)2 Workshop program (preliminary)
(Ce)2 Workshop program (preliminary)(Ce)2 Workshop program (preliminary)
(Ce)2 Workshop program (preliminary)
 
Unemployment and Intra-Household Dynamics: the Effect of Male Job Loss on Int...
Unemployment and Intra-Household Dynamics: the Effect of Male Job Loss on Int...Unemployment and Intra-Household Dynamics: the Effect of Male Job Loss on Int...
Unemployment and Intra-Household Dynamics: the Effect of Male Job Loss on Int...
 
Football, Alcohol and Domestic Abuse
Football, Alcohol and Domestic AbuseFootball, Alcohol and Domestic Abuse
Football, Alcohol and Domestic Abuse
 
Paid Work for Women and Domestic Violence: Evidence from the Rwandan Coffee M...
Paid Work for Women and Domestic Violence: Evidence from the Rwandan Coffee M...Paid Work for Women and Domestic Violence: Evidence from the Rwandan Coffee M...
Paid Work for Women and Domestic Violence: Evidence from the Rwandan Coffee M...
 
Domestic Violence Legislation - Awareness and Support in Latvia, Russia and U...
Domestic Violence Legislation - Awareness and Support in Latvia, Russia and U...Domestic Violence Legislation - Awareness and Support in Latvia, Russia and U...
Domestic Violence Legislation - Awareness and Support in Latvia, Russia and U...
 
Social Contexts and the Perception of Differential Treatment
Social Contexts and the Perception of Differential TreatmentSocial Contexts and the Perception of Differential Treatment
Social Contexts and the Perception of Differential Treatment
 
Perceptions of violence and their socio-economic determinants: a comparative ...
Perceptions of violence and their socio-economic determinants: acomparative ...Perceptions of violence and their socio-economic determinants: acomparative ...
Perceptions of violence and their socio-economic determinants: a comparative ...
 

Dernier

Uae-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
Uae-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Uae-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
Uae-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
Unveiling Poonawalla Fincorp’s Phenomenal Performance Under Abhay Bhutada’s L...
Unveiling Poonawalla Fincorp’s Phenomenal Performance Under Abhay Bhutada’s L...Unveiling Poonawalla Fincorp’s Phenomenal Performance Under Abhay Bhutada’s L...
Unveiling Poonawalla Fincorp’s Phenomenal Performance Under Abhay Bhutada’s L...beulahfernandes8
 
『澳洲文凭』买科廷大学毕业证书成绩单办理澳洲Curtin文凭学位证书
『澳洲文凭』买科廷大学毕业证书成绩单办理澳洲Curtin文凭学位证书『澳洲文凭』买科廷大学毕业证书成绩单办理澳洲Curtin文凭学位证书
『澳洲文凭』买科廷大学毕业证书成绩单办理澳洲Curtin文凭学位证书rnrncn29
 
cost of capital questions financial management
cost of capital questions financial managementcost of capital questions financial management
cost of capital questions financial managementtanmayarora23
 
2024-04-09 - Pension Playpen roundtable - slides.pptx
2024-04-09 - Pension Playpen roundtable - slides.pptx2024-04-09 - Pension Playpen roundtable - slides.pptx
2024-04-09 - Pension Playpen roundtable - slides.pptxHenry Tapper
 
2024 Q1 Crypto Industry Report | CoinGecko
2024 Q1 Crypto Industry Report | CoinGecko2024 Q1 Crypto Industry Report | CoinGecko
2024 Q1 Crypto Industry Report | CoinGeckoCoinGecko
 
Unveiling Business Expansion Trends in 2024
Unveiling Business Expansion Trends in 2024Unveiling Business Expansion Trends in 2024
Unveiling Business Expansion Trends in 2024Champak Jhagmag
 
The AES Investment Code - the go-to counsel for the most well-informed, wise...
The AES Investment Code -  the go-to counsel for the most well-informed, wise...The AES Investment Code -  the go-to counsel for the most well-informed, wise...
The AES Investment Code - the go-to counsel for the most well-informed, wise...AES International
 
Amil Baba In Pakistan amil baba in Lahore amil baba in Islamabad amil baba in...
Amil Baba In Pakistan amil baba in Lahore amil baba in Islamabad amil baba in...Amil Baba In Pakistan amil baba in Lahore amil baba in Islamabad amil baba in...
Amil Baba In Pakistan amil baba in Lahore amil baba in Islamabad amil baba in...amilabibi1
 
Unit 4.1 financial markets operations .pdf
Unit 4.1 financial markets operations .pdfUnit 4.1 financial markets operations .pdf
Unit 4.1 financial markets operations .pdfSatyamSinghParihar2
 
10 QuickBooks Tips 2024 - Globus Finanza.pdf
10 QuickBooks Tips 2024 - Globus Finanza.pdf10 QuickBooks Tips 2024 - Globus Finanza.pdf
10 QuickBooks Tips 2024 - Globus Finanza.pdfglobusfinanza
 
Guard Your Investments- Corporate Defaults Alarm.pdf
Guard Your Investments- Corporate Defaults Alarm.pdfGuard Your Investments- Corporate Defaults Alarm.pdf
Guard Your Investments- Corporate Defaults Alarm.pdfJasper Colin
 
INTERNATIONAL TRADE INSTITUTIONS[6].pptx
INTERNATIONAL TRADE INSTITUTIONS[6].pptxINTERNATIONAL TRADE INSTITUTIONS[6].pptx
INTERNATIONAL TRADE INSTITUTIONS[6].pptxaymenkhalfallah23
 
Market Morning Updates for 16th April 2024
Market Morning Updates for 16th April 2024Market Morning Updates for 16th April 2024
Market Morning Updates for 16th April 2024Devarsh Vakil
 
Gender and caste discrimination in india
Gender and caste discrimination in indiaGender and caste discrimination in india
Gender and caste discrimination in indiavandanasingh01072003
 
Overview of Inkel Unlisted Shares Price.
Overview of Inkel Unlisted Shares Price.Overview of Inkel Unlisted Shares Price.
Overview of Inkel Unlisted Shares Price.Precize Formely Leadoff
 
Role of Information and technology in banking and finance .pptx
Role of Information and technology in banking and finance .pptxRole of Information and technology in banking and finance .pptx
Role of Information and technology in banking and finance .pptxNarayaniTripathi2
 
Economic Risk Factor Update: April 2024 [SlideShare]
Economic Risk Factor Update: April 2024 [SlideShare]Economic Risk Factor Update: April 2024 [SlideShare]
Economic Risk Factor Update: April 2024 [SlideShare]Commonwealth
 
Liquidity Decisions in Financial management
Liquidity Decisions in Financial managementLiquidity Decisions in Financial management
Liquidity Decisions in Financial managementshrutisingh143670
 
Introduction to Health Economics Dr. R. Kurinji Malar.pptx
Introduction to Health Economics Dr. R. Kurinji Malar.pptxIntroduction to Health Economics Dr. R. Kurinji Malar.pptx
Introduction to Health Economics Dr. R. Kurinji Malar.pptxDrRkurinjiMalarkurin
 

Dernier (20)

Uae-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
Uae-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Uae-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
Uae-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Unveiling Poonawalla Fincorp’s Phenomenal Performance Under Abhay Bhutada’s L...
Unveiling Poonawalla Fincorp’s Phenomenal Performance Under Abhay Bhutada’s L...Unveiling Poonawalla Fincorp’s Phenomenal Performance Under Abhay Bhutada’s L...
Unveiling Poonawalla Fincorp’s Phenomenal Performance Under Abhay Bhutada’s L...
 
『澳洲文凭』买科廷大学毕业证书成绩单办理澳洲Curtin文凭学位证书
『澳洲文凭』买科廷大学毕业证书成绩单办理澳洲Curtin文凭学位证书『澳洲文凭』买科廷大学毕业证书成绩单办理澳洲Curtin文凭学位证书
『澳洲文凭』买科廷大学毕业证书成绩单办理澳洲Curtin文凭学位证书
 
cost of capital questions financial management
cost of capital questions financial managementcost of capital questions financial management
cost of capital questions financial management
 
2024-04-09 - Pension Playpen roundtable - slides.pptx
2024-04-09 - Pension Playpen roundtable - slides.pptx2024-04-09 - Pension Playpen roundtable - slides.pptx
2024-04-09 - Pension Playpen roundtable - slides.pptx
 
2024 Q1 Crypto Industry Report | CoinGecko
2024 Q1 Crypto Industry Report | CoinGecko2024 Q1 Crypto Industry Report | CoinGecko
2024 Q1 Crypto Industry Report | CoinGecko
 
Unveiling Business Expansion Trends in 2024
Unveiling Business Expansion Trends in 2024Unveiling Business Expansion Trends in 2024
Unveiling Business Expansion Trends in 2024
 
The AES Investment Code - the go-to counsel for the most well-informed, wise...
The AES Investment Code -  the go-to counsel for the most well-informed, wise...The AES Investment Code -  the go-to counsel for the most well-informed, wise...
The AES Investment Code - the go-to counsel for the most well-informed, wise...
 
Amil Baba In Pakistan amil baba in Lahore amil baba in Islamabad amil baba in...
Amil Baba In Pakistan amil baba in Lahore amil baba in Islamabad amil baba in...Amil Baba In Pakistan amil baba in Lahore amil baba in Islamabad amil baba in...
Amil Baba In Pakistan amil baba in Lahore amil baba in Islamabad amil baba in...
 
Unit 4.1 financial markets operations .pdf
Unit 4.1 financial markets operations .pdfUnit 4.1 financial markets operations .pdf
Unit 4.1 financial markets operations .pdf
 
10 QuickBooks Tips 2024 - Globus Finanza.pdf
10 QuickBooks Tips 2024 - Globus Finanza.pdf10 QuickBooks Tips 2024 - Globus Finanza.pdf
10 QuickBooks Tips 2024 - Globus Finanza.pdf
 
Guard Your Investments- Corporate Defaults Alarm.pdf
Guard Your Investments- Corporate Defaults Alarm.pdfGuard Your Investments- Corporate Defaults Alarm.pdf
Guard Your Investments- Corporate Defaults Alarm.pdf
 
INTERNATIONAL TRADE INSTITUTIONS[6].pptx
INTERNATIONAL TRADE INSTITUTIONS[6].pptxINTERNATIONAL TRADE INSTITUTIONS[6].pptx
INTERNATIONAL TRADE INSTITUTIONS[6].pptx
 
Market Morning Updates for 16th April 2024
Market Morning Updates for 16th April 2024Market Morning Updates for 16th April 2024
Market Morning Updates for 16th April 2024
 
Gender and caste discrimination in india
Gender and caste discrimination in indiaGender and caste discrimination in india
Gender and caste discrimination in india
 
Overview of Inkel Unlisted Shares Price.
Overview of Inkel Unlisted Shares Price.Overview of Inkel Unlisted Shares Price.
Overview of Inkel Unlisted Shares Price.
 
Role of Information and technology in banking and finance .pptx
Role of Information and technology in banking and finance .pptxRole of Information and technology in banking and finance .pptx
Role of Information and technology in banking and finance .pptx
 
Economic Risk Factor Update: April 2024 [SlideShare]
Economic Risk Factor Update: April 2024 [SlideShare]Economic Risk Factor Update: April 2024 [SlideShare]
Economic Risk Factor Update: April 2024 [SlideShare]
 
Liquidity Decisions in Financial management
Liquidity Decisions in Financial managementLiquidity Decisions in Financial management
Liquidity Decisions in Financial management
 
Introduction to Health Economics Dr. R. Kurinji Malar.pptx
Introduction to Health Economics Dr. R. Kurinji Malar.pptxIntroduction to Health Economics Dr. R. Kurinji Malar.pptx
Introduction to Health Economics Dr. R. Kurinji Malar.pptx
 

How Much Can We Generalize? Measuring the External Validity of Impact Evaluations

  • 1. How Much Can We Generalize? Measuring the External Validity of Impact Evaluations Eva Vivalt∗ New York University August 31, 2015 Abstract Impact evaluations aim to predict the future, but they are rooted in particular contexts and to what extent they generalize is an open and important question. I founded an organization to systematically collect and synthesize impact evalu- ation results on a wide variety of interventions in development. These data allow me to answer this and other questions for the first time using a large data set of studies. I consider several measures of generalizability, discuss the strengths and limitations of each metric, and provide benchmarks based on the data. I use the example of the effect of conditional cash transfers on enrollment rates to show how some of the heterogeneity can be modelled and the effect this can have on the generalizability measures. The predictive power of the model improves over time as more studies are completed. Finally, I show how researchers can estimate the generalizability of their own study using their own data, even when data from no comparable studies exist. ∗ E-mail: eva.vivalt@nyu.edu. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal B´o, Hunt Allcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel, Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, Columbia University, New York University, the World Bank, Cornell University, Princeton University, the University of Toronto, the London School of Economics, the Australian National University, and the University of Ottawa, among others, and participants at the 2015 ASSA meeting and 2013 Association for Public Policy Analysis and Management Fall Research Conference for helpful comments. I am also grateful for the hard work put in by many at AidGrade over the duration of this project, including but not limited to Jeff Qiu, Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Mi Shen, Ning Zhang, Jennifer Ambrose, Naomi Crowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha Jalil, Risa Santoso and Catherine Razeto. 1
  • 2. 1 Introduction In the last few years, impact evaluations have become extensively used in development economics research. Policymakers and donors typically fund impact evaluations precisely to figure out how effective a similar program would be in the future to guide their decisions on what course of action they should take. However, it is not yet clear how much we can extrapolate from past results or under which conditions. Further, there is some evidence that even a similar program, in a similar environment, can yield different results. For ex- ample, Bold et al. (2013) carry out an impact evaluation of a program to provide contract teachers in Kenya; this was a scaled-up version of an earlier program studied by Duflo, Du- pas and Kremer (2012). The earlier intervention studied by Duflo, Dupas and Kremer was implemented by an NGO, while Bold et al. compared implementation by an NGO and the government. While Duflo, Dupas and Kremer found positive effects, Bold et al. showed significant results only for the NGO-implemented group. The different findings in the same country for purportedly similar programs point to the substantial context-dependence of im- pact evaluation results. Knowing this context-dependence is crucial in order to understand what we can learn from any impact evaluation. While the main reason to examine generalizability is to aid interpretation and improve predictions, it would also help to direct research attention to where it is most needed. If generalizability were higher in some areas, fewer papers would be needed to understand how people would behave in a similar situation; conversely, if there were topics or regions where generalizability was low, it would call for further study. With more information, researchers can better calibrate where to direct their attentions to generate new insights. It is well-known that impact evaluations only happen in certain contexts. For example, Figure 1 shows a heat map of the geocoded impact evaluations in the data used in this paper overlaid by the distribution of World Bank projects (black dots). Both sets of data are geo- graphically clustered, and whether or not we can reasonably extrapolate from one to another depends on how much related heterogeneity there is in treatment effects. Allcott (forthcom- ing) recently showed that site selection bias was an issue for randomized controlled trials (RCTs) on a firm’s energy conservation programs. Microfinance institutions that run RCTs and hospitals that conduct clinical trials are also selected (Allcott, forthcoming), and World Bank projects that receive an impact evaluation are different from those that do not (Vivalt, 2015). Others have sought to explain heterogeneous treatment effects in meta-analyses of specific topics (e.g. Saavedra and Garcia, 2013, among many others for conditional cash transfers), or to argue they are so heterogeneous they cannot be adequately modelled (e.g. Deaton, 2011; Pritchett and Sandefur, 2013). 2
  • 3. Figure 1: Growth of Impact Evaluations and Location Relative to Programs The figure on the left shows a heat map of the impact evaluations in AidGrade’s database overlaid by black dots indicating where the World Bank has done projects. While there are many other development programs not done by the World Bank, this figure illustrates the great numbers and geographical dispersion of development programs. The figure on the right plots the number of studies that came out in each year that are contained in each of three databases described in the text: 3ie’s title/abstract/keyword database of impact evaluations; J-PAL’s database of affiliated randomized controlled trials; and AidGrade’s database of impact evaluation results data. Impact evaluations are still exponentially increasing in number and in terms of the re- sources devoted to them. The World Bank recently received a major grant from the UK aid agency DFID to expand its already large impact evaluation works; the Millennium Challenge Corporation has committed to conduct rigorous impact evaluations for 50% of its activities, with “some form of credible evaluation of impact” for every activity (Millennium Challenge Corporation, 2009); and the U.S. Agency for International Development is also increasingly invested in impact evaluations, coming out with a new policy in 2011 that directs 3% of program funds to evaluation.1 Yet while impact evaluations are still growing in development, a few thousand are al- ready complete. Figure 1 plots the explosion of RCTs that researchers affiliated with J-PAL, a center for development economics research, have completed each year; alongside are the number of development-related impact evaluations released that year according to 3ie, which keeps a directory of titles, abstracts, and other basic information on impact evaluations more broadly, including quasi-experimental designs; finally, the dashed line shows the number of papers that came out in each year that are included in AidGrade’s database of impact eval- uation results, which will be described shortly. 1 While most of these are less rigorous “performance evaluations”, country mission leaders are supposed to identify at least one opportunity for impact evaluation for each development objective in their 3-5 year plans (USAID, 2011). 3
  • 4. In short, while we do impact evaluation to figure out what will happen in the future, many issues have been raised about how well we can extrapolate from past impact evalua- tions, and despite the importance of the topic, previously we were unable to do little more than guess or examine the question in narrow settings as we did not have the data. Now we have the opportunity to address speculation, drawing on a large, unique dataset of impact evaluation results. I founded a non-profit organization dedicated to gathering this data. That organization, AidGrade, seeks to systematically understand which programs work best where, a task that requires also knowing the limits of our knowledge. To date, AidGrade has conducted 20 meta-analyses and systematic reviews of different development programs.2 Data gathered through meta-analyses are the ideal data to answer the question of how much we can ex- trapolate from past results, and since data on these 20 topics were collected in the same way, coding the same outcomes and other variables, we can look across different types of programs to see if there are any more general trends. Currently, the data set contains 647 pa- pers on 210 narrowly-defined intervention-outcome combinations, with the greater database containing 15,021 estimates. I define generalizability and discuss several metrics with which to measure it. Other disciplines have considered generalizability more, so I draw on the literature relating to meta-analysis, which has been most well-developed in medicine, as well as the psychometric literature on generalizability theory (Higgins and Thompson, 2002; Shavelson and Webb, 2006; Briggs and Wilson, 2007). The measures I discuss could also be used in conjunction with any model that seeks to explain variation in treatment effects (e.g. Dehejia, Pop-Eleches and Samii, 2015) to quantify the proportion of variation that such a model explains. Since some of the analyses will draw upon statistical methods not commonly used in economics, I will use the concrete example of conditional cash transfers (CCTs), which are relatively well-understood and on which many papers have been written, to elucidate the issues. While this paper focuses on results for impact evaluations of development programs, this is only one of the first areas within economics to which these kinds of methods can be applied. In many of the sciences, knowledge is built through a combination of researchers conducting individual studies and other researchers synthesizing the evidence through meta-analysis. This paper begins that natural next step. 2 Throughout, I will refer to all 20 as meta-analyses, but some did not have enough comparable outcomes for meta-analysis and became systematic reviews. 4
  • 5. 2 Theory 2.1 Heterogeneous Treatment Effects I model treatment effects as potentially depending on the context of the intervention. Each impact evaluation is on a particular intervention and covers a number of outcomes. The relationship between an outcome, the inputs that were part of the intervention, and the context of the study is complex. In the simplest model, we can imagine that context can be represented a “contextual variable”, C, such that: Zj “ α ` βTj ` δCj ` γTjCj ` εj (1) where j indexes the individual, Z represents the value of an aggregate outcome such as “enrollment rates”, T indicates being treated, and C represents a contextual variable, such as the type of agency that implemented the program.3 In this framework, a particular impact evaluation might explicitly estimate: Zj “ α ` β1 Tj ` εj (2) but, as Equation 1 can be re-written as Zj “ α ` pβ ` γCjqTj ` δCj ` εj, what β1 is really capturing is the effect β1 “ β ` γC. When C varies, unobserved, in different contexts, the variance of β1 increases. This is the simplest case. One can imagine that the true state of the world has “interac- tion effects all the way down”. Interaction terms are often considered a second-order problem. However, that intuition could stem from the fact that we usually look for interaction terms within an already fairly homogeneous dataset - e.g. data from a single country, at a single point in time, on a par- ticularly selected sample. Not all aspects of context need matter to an intervention’s outcomes. The set of con- textual variables can be divided into a critical set on which outcomes depend and an set on which they do not; I will ignore the latter. Further, the relationship between Z and C can vary by intervention or outcome. For example, school meals programs might have more of an effect on younger children, but scholarship programs could plausibly affect older children more. If one were to regress effect size on the contextual variable “age”, we would get differ- ent results depending on which intervention and outcome we were considering. Therefore, 3 Z can equally well be thought of as the average individual outcome for an intervention. Throughout, I take high values for an outcome to represent a beneficial change unless otherwise noted; if an outcome represents a negative characteristic, like incidence of a disease, its sign will be flipped before analysis. 5
  • 6. it will be important in this paper to look only at a restricted set of contextual variables which could plausibly work in a similar way across different interventions. Additional anal- ysis could profitably be done within some interventions, but this is outside the scope of this paper. Generalizability will ultimately depend on the heterogeneity of treatment effects. The next section formally defines generalizability for use in this paper. 2.2 Generalizability: Definitions and Measurement Definition 1 Generalizability is the ability to predict results accurately out of sample. Definition 2 Local generalizability is the ability to predict results accurately in a particular out-of-sample group. There are several ways to operationalize these definitions. The ability to predict results hinges both on the variability of the results and the proportion that can be explained. For example, if the overall variability in a set of results is high, this might not be as concerning if the proportion of variability that can be explained is also high. It is straightforward to measure the variance in results. However, these statistics need to be benchmarked in order to know what is a “high” or “low” variance. One advantage of the large data set used in this paper is that I can use it to benchmark the results from different intervention-outcome combinations against each other. This is not the first paper to tentatively suggest a scale. Other rules of thumb have also been created in this manner, such as those used to consider the magnitude of effect sizes (0-0.2 SD = “small”, 0.2-0.5 = “medium”, ą 0.5 SD = “large”) (Cohen, 1988) or the measure of the impact of heterogeneity on meta-analysis results, I2 (0.25=“low”, 0.5=“medium”, 0.75=“high”) (Higgins et al., 2003). I can also compare across-paper variation to within-paper variation, with the idea that within-study variation should represent a lower bound to across-study variation within the same intervention-outcome combination. Further, I can create variance benchmarks based on back-of-the-envelope calculations for what the variance would imply for predictive power under a set of assumptions. This will be discussed in more detail later. One potential drawback to considering the variance of studies’ results is that we might be concerned that studies that have higher effect sizes or are measured in terms of units with larger scales have larger variances. This would limit us to making comparisons only between data with the same scale. We could either: 1) restrict attention to those outcomes in the same natural units (e.g. enrollment rates in percentage points); 2) convert results to 6
  • 7. be in terms of a common unit, such as standard deviations4 ; 3) scale the standard deviation by the mean result, creating the coefficient of variation. The coefficient of variation represents the inverse of the signal-to-noise ratio, and as a unitless figure can be compared across intervention-outcome combinations with different natural units. It is not immune to criticism, however, particularly in that it may result in large values as the mean approaches zero.5 All the measures discussed so far focus on variation. However, if we could explain the variation, it would no longer worsen our ability to make predictions in a new setting, so long as we had all the necessary data from that setting, such as covariates, with which to extrapolate. To explain variation, we need a model. The meta-analysis literature suggests two general types of models which can be parameterized in many ways: fixed-effect models and random-effects models. Fixed-effect models assume there is one true effect of a particular program and all differences between studies can be attributed simply to sampling error. In other words: Yi “ θ ` εi (3) where Yi is the observed effect size of a particular study, θ is the true effect and εi is the error term. Random-effects models do not make this assumption; the true effect could potentially vary from context to context. Here, Yi “ θi ` εi (4) “ ¯θ ` ηi ` εi (5) where θi is the effect size for a particular study i, ¯θ is the mean true effect size, ηi is a particular study’s divergence from that mean true effect size, and εi is the error. Random- effects models are more plausible and they are necessary if we think there are heterogeneous treatment effects, so I use them in this paper. Random-effects models can also be modified by the addition of explanatory variables, at which point they are called mixed models; I will also use mixed models in this paper. Sampling variance, varpYi|θiq, is denoted as σ2 and between-study variance, varpθiq, τ2 . 4 This can be problematic if the standard deviations themselves vary but is a common approach in the meta-analysis literature in lieu of a better option. 5 This paper follows convention and reports the absolute value of the coefficient of variation wherever it appears. 7
  • 8. This variation in observed effect sizes is then: varpYiq “ τ2 ` σ2 (6) and the proportion of the variation that is not sampling error is: I2 “ τ2 τ2 ` σ2 (7) The I2 is an established metric in the meta-analysis literature that helps determine whether a fixed or random effects model is more appropriate; the higher I2 , the less plausible it is that sampling error drives all the variation in results. I2 is considered “low” at 0.25, “medium” at 0.5, and “high” at 0.75 (Higgins et al., 2003).6 If we wanted to explain more of the variation, we could do moderator or mediator analysis, in which we examine how results vary with the characteristics of the study, characteristics of its sample, or details about the intervention and its implementation. A linear meta-regression is one way of accomplishing this goal, explicitly estimating: Yi “ β0 ` ÿ n βnXn ` ηi ` εi where Xn are explanatory variables. This is a mixed model and, upon estimating it, we can calculate several additional statistics: the amount of residual variation in Yi, after accounting for Xn, varRpYi ´ pYiq, the coefficient of residual variation, CVRpYi ´ pYiq, and the residual I2 R. Further, we can examine the R2 of the meta-regression. It should be noted that a linear meta-regression is only one way of modelling variation in Yi. The I2 , for example, is analogous to the reliability coefficient of classical test theory or the generalizability coefficient of generalizability theory (a branch of psychometrics), both of which estimate the proportion of variation that is not error. In this literature, additional heterogeneity is usually modelled using ANOVA rather than meta-regression. Modelling variation in treatment effects also does not have to occur only retrospectively at the conclu- sion of studies; we can imagine that a carefully-designed study could anticipate and estimate some of the potential sources of variation experimentally. Table 1 summarizes the different indicators, dividing them into measures of variation and measures of the proportion of variation that is systematic. Each of these metrics has its advantages and disadvantages. Table 2 summarizes the 6 The Cochrane Collaboration uses a slightly different set of norms, saying 0-0.4 “might not be important”, 0.3-0.6 “may represent moderate heterogeneity”, 0.5-0.9 “may represent substantial heterogeneity”, and 0.75- 1 “considerable heterogeneity” (Higgins and Green, 2011). 8
  • 9. Table 1: Summary of heterogeneity measures Measure of variation Measure of proportion of variation that is systematic Measure makes use of explanatory variables varpYiq varRpYi ´ pYiq CVpYiq CVRpYi ´ pYiq I2 I2 R R2 Table 2: Desirable properties of a measure of heterogeneity Does not depend on the number of studies in a cell Does not depend on the precision of individual es- timates Does not depend on the estimates’ units Does not depend on the mean re- sult in the cell varpYiq varRpYi ´ pYiq CVpYiq CVRpYi ´ pYiq I2 I2 R R2 A “cell” here refers to an intervention-outcome combination. The “precision” of an estimate refers to its standard error. desirable properties of a measure of heterogeneity and which properties are possessed by each of the discussed indicators. Measuring heterogeneity using the variance of Yi requires the Yi to have comparable units. Using the coefficient of variation requires the assumption that the mean effect size is an appropriate measure with which to scale sd(Yi). The variance and coefficient of variation also do not have anything to say about the amount of heterogeneity that can be explained. Adding explanatory variables also has its limitations. In any model, we have no way to guarantee that we are indeed capturing all the relevant factors. While I2 has the nice property that it disaggregates sampling variance as a source of variation, estimating it depends on the weights applied to each study’s results and thus, in turn, on the sample sizes of the studies. The R2 has its own well-known caveats, such as that it can be artificially inflated by over-fitting. 9
  • 10. Having discussed the different measures of generalizability I will use in this paper, I turn to describe how I will estimate the parameters of the random effects or mixed models. 2.3 Hierarchical Bayesian Analysis This paper uses meta-analysis as a tool to synthesize evidence. As a quick review, there are many steps in a meta-analysis, most of which have to do with the selection of the constituent papers. The search and screening of papers will be described in the data section; here, I merely discuss the theory behind how meta-analyses combine results and estimate the parameters σ2 and τ2 that will be used to generate I2 . I begin by presenting the random effects model, followed by the related strategy to estimate a mixed model. 2.4 Estimating a Random Effects Model To build a hierarchical Bayesian random effects model, I first assume the data are nor- mally distributed: Yij|θi „ Npθi, σ2 q (8) where j indexes the individuals in the study. I do not have individual-level data, but instead can use sufficient statistics: Yi|θi „ Npθi, σ2 i q (9) where Yi is the sample mean and σ2 i the sample variance. This provides the likelihood for θi. I also need a prior for θi. I assume between-study normality: θi „ Npµ, τ2 q (10) where µ and τ are unknown hyperparameters. Conditioning on the distribution of the data, given by Equation 9, I get a posterior: θi|µ, τ, Y „ Npˆθi, Viq (11) where ˆθi “ Yi σ2 i ` µ τ2 1 σ2 i ` 1 τ2 , Vi “ 1 1 σ2 i ` 1 τ2 (12) I then need to pin down µ|τ and τ by constructing their posterior distributions given non-informative priors and updating based on the data. I assume a uniform prior for µ|τ, 10
  • 11. and as the Yi are estimates of µ with variance pσ2 i ` τ2 q, obtain: µ|τ, Y „ Npˆµ, Vµq (13) where ˆµ “ ř i Yi σ2 i `τ2 ř i 1 σ2 i `τ2 , Vµ “ ÿ i 1 1 σ2 i `τ2 (14) For τ, note that ppτ|Y q “ ppµ,τ|Y q ppµ|τ,Y q . The denominator follows from Equation 12; for the numerator, we can observe that ppµ, τ|Y q is proportional to ppµ, τqppY |µ, τq, and we know the marginal distribution of Yi|µ, τ: Yi|µ, τ „ Npµ, σ2 i ` τ2 q (15) I use a uniform prior for τ, following Gelman et al. (2005). This yields the posterior for the numerator: ppµ, τ|Y q9ppµ, τq ź i NpYi|µ, σ2 i ` τ2 q (16) Putting together all the pieces in reverse order, I first simulate τ, then generate ppτ|Y q using τ, followed by µ and finally θi. 2.5 Estimating a Mixed Model The strategy here is similar. Appendix D contains a derivation. 3 Data This paper uses a database of impact evaluation results collected by AidGrade, a U.S. non-profit research institute that I founded in 2012. AidGrade focuses on gathering the results of impact evaluations and analyzing the data, including through meta-analysis. Its data on impact evaluation results were collected in the course of its meta-analyses from 2012-2014 (AidGrade, 2015). AidGrade’s meta-analyses follow the standard stages: (1) topic selection; (2) a search for relevant papers; (3) screening of papers; (4) data extraction; and (5) data analysis. In addition, it pays attention to (6) dissemination and (7) updating of results. Here, I will discuss the selection of papers (stages 1-3) and the data extraction protocol (stage 4); more detail is provided in Appendix B. 11
  • 12. 3.1 Selection of Papers The interventions that were selected for meta-analysis were selected largely on the basis of there being a sufficient number of studies on that topic. Five AidGrade staff members each independently made a preliminary list of interventions for examination; the lists were then combined and searches done for each topic to determine if there were likely to be enough impact evaluations for a meta-analysis. The remaining list was voted on by the general public online and partially randomized. Appendix B provides further detail. A comprehensive literature search was done using a mix of the search aggregators Sci- Verse, Google Scholar, and EBSCO/PubMed. The online databases of J-PAL, IPA, CEGA and 3ie were also searched for completeness. Finally, the references of any existing system- atic reviews or meta-analyses were collected. Any impact evaluation which appeared to be on the intervention in question was included, barring those in developed countries.7 Any paper that tried to consider the counterfactual was considered an impact evaluation. Both published papers and working papers were in- cluded. The search and screening criteria were deliberately broad. There is not enough room to include the full text of the search terms and inclusion criteria for all 20 topics in this paper, but these are available in an online appendix as detailed in Appendix A. 3.2 Data Extraction The subset of the data on which I am focusing is based on those papers that passed all screening stages in the meta-analyses. Again, the search and screening criteria were very broad and, after passing the full text screening, the vast majority of papers that were later excluded were excluded merely because they had no outcome variables in common or did not provide adequate data (for example, not providing data that could be used to calculate the standard error of an estimate, or for a variety of other quirky reasons, such as displaying results only graphically). The small overlap of outcome variables is a surprising and notable feature of the data. Ultimately, the data I draw upon for this paper consist of 15,021 results (double-coded and then reconciled by a third researcher) across 647 papers covering the 20 types of development program listed in Table 3.8 For sake of comparison, though the two organizations clearly do different things, at present time of writing this is more impact eval- 7 High-income countries, according to the World Bank’s classification system. 8 Three titles here may be misleading. “Mobile phone-based reminders” refers specifically to SMS or voice reminders for health-related outcomes. “Women’s empowerment programs” required an educational component to be included in the intervention and it could not be an unrelated intervention that merely dis- aggregated outcomes by gender. Finally, micronutrients were initially too loosely defined; this was narrowed down to focus on those providing zinc to children, but the other micronutrient papers are still included in the data, with a tag, as they may still be useful. 12
  • 13. uations than J-PAL has published, concentrated in these 20 topics. Unfortunately, only 318 of these papers both overlapped in outcomes with another paper and were able to be stan- dardized and thus included in the main results which rely on intervention-outcome groups. Outcomes were defined under several rules of varying specificity, as will be discussed shortly. Table 3: List of Development Programs Covered 2012 2013 Conditional cash transfers Contract teachers Deworming Financial literacy training Improved stoves HIV education Insecticide-treated bed nets Irrigation Microfinance Micro health insurance Safe water storage Micronutrient supplementation Scholarships Mobile phone-based reminders School meals Performance pay Unconditional cash transfers Rural electrification Water treatment Women’s empowerment programs 73 variables were coded for each paper. Additional topic-specific variables were coded for some sets of papers, such as the median and mean loan size for microfinance programs. This paper focuses on the variables held in common across the different topics. These include which method was used; if randomized, whether it was randomized by cluster; whether it was blinded; where it was (village, province, country - these were later geocoded in a sepa- rate process); what kind of institution carried out the implementation; characteristics of the population; and the duration of the intervention from the baseline to the midline or endline results, among others. A full set of variables and the coding manual is available online, as detailed in Appendix A. As this paper pays particular attention to the program implementer, it is worth discussing how this variable was coded in more detail. There were several types of implementers that could be coded: governments, NGOs, private sector firms, and academics. There was also a code for “other” (primarily collaborations) or “unclear”. The vast majority of studies were implemented by academic research teams and NGOs. This paper considers NGOs and aca- demic research teams together because it turned out to be practically difficult to distinguish between them in the studies, especially as the passive voice was frequently used (e.g. “X was done” without noting who did it). There were only a few private sector firms involved, so they are considered with the “other” category in this paper. Studies tend to report results for multiple specifications. AidGrade focused on those 13
  • 14. results least likely to have been influenced by author choices: those with the fewest con- trols, apart from fixed effects. Where a study reported results using different methodologies, coders were instructed to collect the findings obtained under the authors’ preferred method- ology; where the preferred methodology was unclear, coders were advised to follow the internal preference ordering of prioritizing randomized controlled trials, followed by regres- sion discontinuity designs and differences-in-differences, followed by matching, and to collect multiple sets of results when they were unclear on which to include. Where results were presented separately for multiple subgroups, coders were similarly advised to err on the side of caution and to collect both the aggregate results and results by subgroup except where the author appeared to be only including a subgroup because results were significant within that subgroup. For example, if an author reported results for children aged 8-15 and then also presented results for children aged 12-13, only the aggregate results would be recorded, but if the author presented results for children aged 8-9, 10-11, 12-13, and 14-15, all subgroups would be coded as well as the aggregate result when presented. Authors only rarely reported isolated subgroups, so this was not a major issue in practice. When considering the variation of effect sizes within a group of papers, the definition of the group is clearly critical. Two different rules were initially used to define outcomes: a strict rule, under which only identical outcome variables are considered alike, and a loose rule, under which similar but distinct outcomes are grouped into clusters. The precise coding rules were as follows: 1. We consider outcome A to be the same as outcome B under the “strict rule” if out- comes A and B measure the exact same quality. Different units may be used, pending conversion. The outcomes may cover different timespans (e.g. encompassing both outcomes over “the last month” and “the last week”). They may also cover different populations (e.g. children or adults). Examples: height; attendance rates. 2. We consider outcome A to be the same as outcome B under the “loose rule” if they do not meet the strict rule but are clearly related. Example: parasitemia greater than 4000/µl with fever and parasitemia greater than 2500/µl. Clearly, even under the strict rule, differences between the studies may exist, however, using two different rules allows us to isolate the potential sources of variation, and other variables were coded to capture some of this variation, such as the age of those in the sample. If one were to divide the studies by these characteristics, however, the data would usually be too sparse for analysis. Interventions were also defined separately and coders were also asked to write a short description of the details of each program. Program names were recorded so as to identify 14
  • 15. those papers on the same program, such as the various evaluations of PROGRESA. After coding, the data were then standardized to make results easier to interpret and so as not to overly weight those outcomes with larger scales. The typical way to compare results across different outcomes is by using the standardized mean difference, defined as: SMD “ µ1 ´ µ2 σp where µ1 is the mean outcome in the treatment group, µ2 is the mean outcome in the control group, and σp is the pooled standard deviation. When data are not available to calculate the pooled standard deviation, it can be approximated by the standard deviation of the depen- dent variable for the entire distribution of observations or as the standard deviation in the control group (Glass, 1976). If that is not available either, due to standard deviations not having been reported in the original papers, one can use the typical standard deviation for the intervention-outcome. I follow this approach to calculate the standardized mean differ- ence, which is then used as the effect size measure for the rest of the paper unless otherwise noted. This paper uses the “strict” outcomes where available, but the “loose” outcomes where that would keep more data. For papers which were follow-ups of the same study, the most recent results were used for each outcome. Finally, one paper appeared to misreport results, suggesting implausibly low values and standard deviations for hemoglobin. These results were excluded and the paper’s correspond- ing author contacted. Excluding this paper’s results, effect sizes range between -1.5 and 1.8 SD, with an interquartile range of 0 to 0.2 SD. So as to mitigate sensitivity to individual results, especially with the small number of papers in some intervention-outcome groups, I restrict attention to those standardized effect sizes less than 2 SD away from 0, dropping 1 additional observation. I report main results including this observation in the Appendix. 3.3 Data Description Figure 2 summarizes the distribution of studies covering the interventions and outcomes considered in this paper that can be standardized. Attention will typically be limited to those intervention-outcome combinations on which we have data for at least three papers. Table 13 in Appendix C lists the interventions and outcomes and describes their results in a bit more detail, providing the distribution of significant and insignificant results. It should be emphasized that the number of negative and significant, insignificant, and positive and significant results per intervention-outcome combination only provide ambiguous evidence of the typical efficacy of a particular type of intervention. Simply tallying the numbers in 15
  • 16. each category is known as “vote counting” and can yield misleading results if, for example, some studies are underpowered. Table 4 further summarizes the distribution of papers across interventions and highlights the fact that papers exhibit very little overlap in terms of outcomes studied. This is consistent with the story of researchers each wanting to publish one of the first papers on a topic. Vivalt (2015a) finds that later papers on the same intervention-outcome combination more often remain as working papers. A note must be made about combining data. When conducting a meta-analysis, the Cochrane Handbook for Systematic Reviews of Interventions recommends collapsing the data to one observation per intervention-outcome-paper, and I do this for generating the within intervention-outcome meta-analyses (Higgins and Green, 2011). Where results had been reported for multiple subgroups (e.g. women and men), I aggregated them as in the Cochrane Handbook’s Table 7.7.a. Where results were reported for multiple time periods (e.g. 6 months after the intervention and 12 months after the intervention), I used the most comparable time periods across papers. When combining across multiple outcomes, which has limited use but will come up later in the paper, I used the formulae from Borenstein et al. (2009), Chapter 24. 16
  • 18. Table 4: Descriptive Statistics: Distribution of Narrow Outcomes Intervention Number of Mean papers Max papers outcomes per outcome per outcome Conditional cash transfers 10 21 37 Contract teachers 1 3 3 Deworming 12 13 18 Financial literacy 1 5 5 HIV/AIDS Education 3 8 10 Improved stoves 4 2 2 Insecticide-treated bed nets 1 9 9 Irrigation 2 2 2 Micro health insurance 1 2 2 Microfinance 5 4 5 Micronutrient supplementation 23 27 47 Mobile phone-based reminders 2 4 5 Performance pay 1 3 3 Rural electrification 3 3 3 Safe water storage 1 2 2 Scholarships 3 4 5 School meals 3 3 3 Unconditional cash transfers 3 9 11 Water treatment 2 5 6 Women’s empowerment programs 2 2 2 Average 4.2 6.5 9.0 18
  • 19. 4 Generalizability of Impact Evaluation Results 4.1 Without Modelling Heterogeneity Table 5 presents results for the metrics described earlier, within intervention-outcome combinations. All Yi were converted to be in terms of standard deviations to put them on a common scale before statistics were calculated, with the aforementioned caveats. The different measures yield quite different results, as they measure different things, as previously discussed. The coefficient of variation depends heavily on the mean; the I2 , on the precision of the underlying estimates. Table 5: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes Intervention Outcome var(Yi) CV(Yi) I2 Microfinance Assets 0.000 5.508 1.000 Rural Electrification Enrollment rate 0.001 0.129 0.768 Micronutrients Cough prevalence 0.001 1.648 0.995 Microfinance Total income 0.001 0.989 0.999 Microfinance Savings 0.002 1.773 1.000 Financial Literacy Savings 0.004 5.472 0.891 Microfinance Profits 0.005 5.448 1.000 Contract Teachers Test scores 0.005 0.403 1.000 Performance Pay Test scores 0.006 0.608 1.000 Micronutrients Body mass index 0.007 0.675 1.000 Conditional Cash Transfers Unpaid labor 0.009 0.920 0.797 Micronutrients Weight-for-age 0.009 1.941 0.884 Micronutrients Weight-for-height 0.010 2.148 0.677 Micronutrients Birthweight 0.010 0.981 0.827 Micronutrients Height-for-age 0.012 2.467 0.942 Conditional Cash Transfers Test scores 0.013 1.866 0.995 Deworming Hemoglobin 0.015 3.377 0.919 Micronutrients Mid-upper arm circumference 0.015 2.078 0.502 Conditional Cash Transfers Enrollment rate 0.015 0.831 1.000 Unconditional Cash Transfers Enrollment rate 0.016 1.093 0.998 Water Treatment Diarrhea prevalence 0.020 0.966 1.000 SMS Reminders Treatment adherence 0.022 1.672 0.780 Conditional Cash Transfers Labor force participation 0.023 1.628 0.424 School Meals Test scores 0.023 1.288 0.559 Micronutrients Height 0.023 4.369 0.826 Micronutrients Mortality rate 0.025 2.880 0.201 Micronutrients Stunted 0.025 1.110 0.262 Bed Nets Malaria 0.029 0.497 0.880 Conditional Cash Transfers Attendance rate 0.030 0.523 0.939 19
  • 20. Micronutrients Weight 0.034 2.696 0.549 HIV/AIDS Education Used contraceptives 0.036 3.117 0.490 Micronutrients Perinatal deaths 0.038 2.096 0.176 Deworming Height 0.049 2.361 1.000 Micronutrients Test scores 0.052 1.694 0.966 Scholarships Enrollment rate 0.053 0.687 1.000 Conditional Cash Transfers Height-for-age 0.055 22.166 0.165 Deworming Weight-for-height 0.072 3.129 0.986 Micronutrients Stillbirths 0.075 3.041 0.108 School Meals Enrollment rate 0.081 1.142 0.080 Micronutrients Prevalence of anemia 0.095 0.793 0.692 Deworming Height-for-age 0.098 1.978 1.000 Deworming Weight-for-age 0.107 2.287 0.998 Micronutrients Diarrhea incidence 0.109 3.300 0.985 Micronutrients Diarrhea prevalence 0.111 1.205 0.837 Micronutrients Fever prevalence 0.146 3.076 0.667 Deworming Weight 0.184 4.758 1.000 Micronutrients Hemoglobin 0.215 1.439 0.984 SMS Reminders Appointment attendance rate 0.224 2.908 0.869 Deworming Mid-upper arm circumference 0.439 1.773 0.994 Conditional Cash Transfers Probability unpaid work 0.609 6.415 0.834 Rural Electrification Study time 0.997 1.102 0.142 How should we interpret these numbers? Higgins and Thompson, who defined I2 , called 0.25 indicative of “low”, 0.5 “medium”, and 0.75 “high” levels of heterogeneity (2002; Higgins et al., 2003). Figure 3 plots a histogram of the results, with lines corresponding to these values demarcated. Clearly, there is a lot of systematic variation in the results according to the I2 measure. No similar defined benchmarks exist for the variance or coefficient of variation, although studies in the medical literature tend to exhibit a coefficient of variation of approximately 0.05-0.5 (Tian, 2005; Ng, 2014). By this standard, too, results would appear quite heterogeneous. 20
  • 21. Figure 3: Density of I2 values We can also compare values across the different intervention-outcome combinations within the data set. Here, the intervention-outcome combinations that fall within the bottom third by variance have varpYiq ď 0.015; the top third have varpYiq ě 0.052. Similarly, the threshold delineating the bottom third for the coefficient of variation is 1.14 and, for the top third, 2.36; for I2 , the thresholds are 0.78 and 0.99, respectively. If we expect these intervention-outcomes to be broadly comparable to others we might want to consider in the future, we could use these values to benchmark future results. Defining dispersion to be “low” or “high” in this manner may be unsatisfying because the classifications that result are relative. Relative classifications might have some value, but sometimes are not so important; for example, it is hard to think that there is a meaningful difference between an I2 of just below 0.99 and an I2 of just above 0.99. An alternative benchmark that might have more appeal is that of the average within-study variance or coefficient of variation. If the across-study variation approached the within-study variation, we might not be so concerned about generalizability. Table 6 illustrates the gap between the across-study and mean within-study variance, coefficient of variation, and I2 , for those intervention-outcomes for which we have enough data to calculate the within-study measures. Not all studies report multiple results for the intervention-outcome combination in question. A paper might report multiple results for a particular intervention-outcome combination if, for example, it were reporting results for different subgroups, such as for different age groups, genders, or geographic areas. The median within-paper variance for those papers for which this can be generated is 0.027, while it is 0.037 across papers; similarly, the median within-paper coefficient of variation is 0.91, compared to 1.48 across papers. If we were to form the I2 for each paper separately, the median within-paper value would be 0.63, as opposed to 0.94 across papers. Figure 21
  • 22. 4 presents the distributions graphically; to increase the sample size, this figure includes results even when there are only two papers within an intervention-outcome combination or two results reported within a paper. 22
  • 23. Table 6: Across-Paper vs. Mean Within-Paper Heterogeneity Intervention Outcome Across-paper Within-paper Across-paper Within-paper Across-paper Within-paper var(Yi) var(Yi) CV(Yi) CV(Yi) I2 I2 Micronutrients Cough prevalence 0.001 0.006 1.017 3.181 0.755 1.000 Conditional Cash Transfers Enrollment rate 0.009 0.027 0.790 0.968 0.998 0.682 Conditional Cash Transfers Unpaid labor 0.009 0.004 0.918 0.853 0.781 0.778 Deworming Hemoglobin 0.009 0.068 1.639 8.687 0.583 0.712 Micronutrients Weight-for-height 0.010 0.005 2.252 * 0.665 0.633 Micronutrients Birthweight 0.010 0.011 0.974 0.963 0.784 0.882 Micronutrients Weight-for-age 0.010 0.124 2.370 0.713 1.000 0.652 School Meals Height-for-age 0.011 0.000 1.086 * 0.942 0.703 Micronutrients Height-for-age 0.012 0.042 2.474 3.751 0.993 0.508 Unconditional Cash Transfers Enrollment rate 0.014 0.014 1.223 * 0.982 0.497 SMS Reminders Treatment adherence 0.022 0.008 1.479 0.672 0.958 0.573 Micronutrients Height 0.023 0.028 4.001 3.471 0.896 0.548 Micronutrients Stunted 0.024 0.059 1.085 24.373 0.348 0.149 Micronutrients Mortality rate 0.026 0.195 2.533 1.561 0.164 0.077 Micronutrients Weight 0.029 0.027 2.852 0.149 0.629 0.228 Micronutrients Fever prevalence 0.034 0.011 5.937 0.126 0.602 0.066 Microfinance Total income 0.037 0.003 1.770 1.232 0.970 1.000 Conditional Cash Transfers Probability unpaid work 0.046 0.386 1.419 0.408 0.989 0.517 Conditional Cash Transfers Attendance rate 0.046 0.018 0.591 0.526 0.988 0.313 Deworming Height 0.048 0.112 1.845 0.211 1.000 0.665 Micronutrients Perinatal deaths 0.049 0.015 2.087 0.234 0.451 0.089 Bed Nets Malaria 0.052 0.047 0.650 4.093 0.967 0.551 Scholarships Enrollment rate 0.053 0.026 1.094 1.561 1.000 0.612 Conditional Cash Transfers Height-for-age 0.055 0.002 22.166 1.212 0.162 0.600 HIV/AIDS Education Used contraceptives 0.059 0.120 2.863 6.967 0.424 0.492 Deworming Weight-for-height 0.072 0.164 3.127 * 1.000 0.907 Deworming Height-for-age 0.100 0.005 2.043 1.842 1.000 0.741 Deworming Weight-for-age 0.108 0.004 2.317 1.040 1.000 0.704 Micronutrients Diarrhea incidence 0.135 0.016 2.844 1.741 0.922 0.807 Micronutrients Diarrhea prevalence 0.137 0.029 1.375 3.385 0.811 0.664 Deworming Weight 0.168 0.121 4.087 1.900 0.995 0.813 Conditional Cash Transfers Labor force participation 0.790 0.047 2.931 4.300 0.378 0.559 Micronutrients Hemoglobin 2.650 0.176 2.982 0.731 1.000 0.996 Within-paper values are based on those papers which report results for different subsets of the data. For closer comparison of the across and within-paper statistics, the across-paper values are based on the same data set, aggregating the within-paper results to one observation per 23
  • 24. intervention-outcome-paper, as discussed. Each paper needs to have reported 3 results for an intervention-outcome combination for it to be included in the calculation, in addition to the requirement of there being 3 papers on the intervention-outcome combination. Due to the slightly different sample, the across-paper statistics diverge slightly from those reported in Table 5. Occasionally, within-paper measures of the mean equal or approach zero, making the coefficient of variation undefined or unreasonable; “*” denotes those coefficients of variation that were either undefined or greater than 10,000,000. 24
  • 25. Figure 4: Distribution of within and across-paper heterogeneity measures We can also gauge the magnitudes of these measures by comparison with effect sizes. We know effect sizes are typically considered “small” if they are less than 0.2 SDs and that the largest coefficient of variation typically considered in the medical literature is 0.5 (Tian, 2005; Ng, 2014). Taking 0.5 as a very conservative upper bound for a “small” coefficient of variation, this would imply a variance of less than 0.01 for an effect size of 0.2. That the actual mean effect size in the data is closer to 0.1 makes this even more of an upper bound; applying the same reasoning to an effect size of 0.1 would result in the threshold being set at a variance of 0.0025. Finally, we can try to set bounds more directly, based on the expected prediction error. Here it is immediately apparent that what counts as large or small error depends on the policy question. In some cases, it might not matter if an effect size were mis-predicted by 25%; in others, a prediction error of this magnitude could mean the difference between choosing one program over another or determine whether a program is worthwhile to pursue at all. Still, if we take the mean effect size within an intervention-outcome to be our “best guess” of how a program will perform and, as an illustrative example, want the prediction error to be less than 25% at least 50% of the time, this would imply a certain cut-off threshold for the variance if we assume that results are normally distributed. Note that the assumption that results are drawn from the same normal distribution and the mean and variance of this distribution can be approximated by the mean and variance of observed results is a simplification for the purpose of a back-of-the-envelope calculation. We would expect results to be drawn from different distributions. Table 7 summarizes the implied bounds for the variance for the prediction error to be less than 25% and 50%, respectively, alongside the actual variance in results within each intervention-outcome. In only 1 of 51 cases is the true variance in results smaller than the variance implied by the 25% prediction error cut-off threshold, and in 9 other cases it is below the 50% prediction error threshold. In other words, the variance of results within each intervention-outcome would imply a prediction error of more than 50% more than 80% of the time. Table 7: Actual Variance vs. Variance for Prediction Error Thresholds Intervention Outcome ¯Yi varpYiq var25 var50 Microfinance Assets 0.003 0.000 0.000 0.000 25
  • 26. Rural Electrification Enrollment rate 0.176 0.001 0.005 0.027 Micronutrients Cough prevalence -0.016 0.001 0.000 0.000 Microfinance Total income 0.029 0.001 0.000 0.001 Microfinance Savings 0.027 0.002 0.000 0.001 Financial Literacy Savings -0.012 0.004 0.000 0.000 Microfinance Profits -0.013 0.005 0.000 0.000 Contract Teachers Test scores 0.182 0.005 0.005 0.029 Performance Pay Test scores 0.131 0.006 0.003 0.015 Micronutrients Body mass index 0.125 0.007 0.002 0.014 Conditional Cash Transfers Unpaid labor 0.103 0.009 0.002 0.009 Micronutrients Weight-for-age 0.050 0.009 0.000 0.002 Micronutrients Weight-for-height 0.045 0.010 0.000 0.002 Micronutrients Birthweight 0.102 0.010 0.002 0.009 Micronutrients Height-for-age 0.044 0.012 0.000 0.002 Conditional Cash Transfers Test scores 0.062 0.013 0.001 0.003 Deworming Hemoglobin 0.036 0.015 0.000 0.001 Micronutrients Mid-upper arm circumference 0.058 0.015 0.001 0.003 Conditional Cash Transfers Enrollment rate 0.150 0.015 0.003 0.019 Unconditional Cash Transfers Enrollment rate 0.115 0.016 0.002 0.011 Water Treatment Diarrhea prevalence 0.145 0.020 0.003 0.018 SMS Reminders Treatment adherence 0.088 0.022 0.001 0.007 Conditional Cash Transfers Labor force participation 0.092 0.023 0.001 0.007 School Meals Test scores 0.117 0.023 0.002 0.012 Micronutrients Height 0.035 0.023 0.000 0.001 Micronutrients Mortality rate -0.054 0.025 0.000 0.003 Micronutrients Stunted 0.143 0.025 0.003 0.018 Bed Nets Malaria 0.342 0.029 0.018 0.101 Conditional Cash Transfers Attendance rate 0.333 0.030 0.017 0.096 Micronutrients Weight 0.068 0.034 0.001 0.004 HIV/AIDS Education Used contraceptives 0.061 0.036 0.001 0.003 Micronutrients Perinatal deaths -0.093 0.038 0.001 0.008 Deworming Height 0.094 0.049 0.001 0.008 Micronutrients Test scores 0.134 0.052 0.003 0.016 Scholarships Enrollment rate 0.336 0.053 0.017 0.098 Conditional Cash Transfers Height-for-age -0.011 0.055 0.000 0.000 Deworming Weight-for-height 0.086 0.072 0.001 0.006 Micronutrients Stillbirths -0.090 0.075 0.001 0.007 School Meals Enrollment rate 0.250 0.081 0.009 0.054 Micronutrients Prevalence of anemia 0.389 0.095 0.023 0.131 Deworming Height-for-age 0.159 0.098 0.004 0.022 Deworming Weight-for-age 0.143 0.107 0.003 0.018 Micronutrients Diarrhea incidence 0.100 0.109 0.002 0.009 Micronutrients Diarrhea prevalence 0.277 0.111 0.012 0.066 26
  • 27. Micronutrients Fever prevalence 0.124 0.146 0.002 0.013 Deworming Weight 0.090 0.184 0.001 0.007 Micronutrients Hemoglobin 0.322 0.215 0.016 0.090 SMS Reminders Appointment attendance rate 0.163 0.224 0.004 0.023 Deworming Mid-upper arm circumference 0.373 0.439 0.021 0.121 Conditional Cash Transfers Probability unpaid work -0.122 0.609 0.002 0.013 Rural Electrification Study time 0.906 0.997 0.125 0.710 var25 represents the variance that would result in a 25% prediction error for draws from a normal distribution centered at ¯Yi. var50 represents the variance that would result in a 50% prediction error. 4.2 With Modelling Heterogeneity 4.2.1 Across Intervention-Outcomes All the results so far have not considered how much heterogeneity can be explained. If the heterogeneity can be systematically modelled, it would improve our ability to make predictions. Do results exhibit any variation that is systematic? To begin, I first present some OLS results, looking across different intervention-outcome combinations, to examine whether effect sizes are associated with any characteristics of the program, study, or sample, pooling data from different intervention-outcomes. As Table 8 indicates, there is some evidence that studies with a smaller number of observations have greater effect sizes than studies based on a larger number of observations. This is what we would expect if specification searching were easier in small datasets; this pattern of results would also be what we would expect if power calculations drove researchers to only proceed with studies with small sample sizes if they believed the program would result in a large effect size or if larger studies are less well-targeted. Interestingly, government- implemented programs fare worse even controlling for sample size (the dummy variable category left out is “Other-implemented”, which mainly consists of collaborations and private sector-implemented interventions). Studies in the Middle East / North Africa region may appear to do slightly better than those in Sub-Saharan Africa (the excluded region category), but not much weight should be put on this as very few studies were conducted in the former region. While these regressions have the advantage of allowing me to draw on a larger sample of studies and we might think that any patterns observed across so many interventions and outcomes are fairly robust, we might be able to explain more variation if we restrict attention to a particular intervention-outcome combination. I therefore focus on the case of conditional cash transfers (CCTs) and enrollment rates, as this is the intervention-outcome combination that contains the largest number of papers. 27
  • 28. Table 8: Regression of Effect Size on Study Characteristics (1) (2) (3) (4) (5) Effect size Effect size Effect size Effect size Effect size b/se b/se b/se b/se b/se Number of -0.011** -0.012*** -0.009* observations (100,000s) (0.00) (0.00) (0.00) Government-implemented -0.107*** -0.087** (0.04) (0.04) Academic/NGO-implemented -0.055 -0.057 (0.04) (0.05) RCT 0.038 (0.03) East Asia -0.003 (0.03) Latin America 0.012 (0.04) Middle East/North 0.275** Africa (0.11) South Asia 0.021 (0.04) Constant 0.120*** 0.180*** 0.091*** 0.105*** 0.177*** (0.00) (0.03) (0.02) (0.02) (0.03) Observations 556 656 656 556 556 R2 0.20 0.23 0.22 0.23 0.20 28
  • 29. 4.2.2 Within an Intervention-Outcome Combination: The Case of CCTs and Enrollment Rates The previous results used the across-intervention-outcome data, which were aggregated to one result per intervention-outcome-paper. However, we might think that more variation could be explained by carefully modelling results within a particular intervention-outcome combination. This section provides an example, using the case of conditional cash transfers and enrollment rates, the intervention-outcome combination covered by the most papers. Suppose we were to try to explain as much variability in outcomes as possible, using sample characteristics. The available variables which might plausibly have a relationship to effect size are: the baseline enrollment rates9 ; the sample size; whether the study was done in a rural or urban setting, or both; results for other programs in the same region10 ; and the age and gender of the sample under consideration. Table 9 shows the results of OLS regressions of the effect size on these variables, in turn. The baseline enrollment rates show the strongest relationship to effect size, as reflected in the R2 and significance levels: it is easier to have large gains where initial rates are low. Some papers pay particular attention to those children that were not enrolled at baseline or that were enrolled at baseline. These are coded as a “0%” or “100%” enrollment rate at baseline but are also represented by two dummy variables. Larger studies and studies done in urban areas also tend to find smaller effect sizes than smaller studies or studies done in rural or mixed urban/rural areas. Finally, for each result I calculate the mean result in the same region, excluding results from the program in question. Results do appear slightly correlated across different programs in the same region. 9 In some cases, only endline enrollment rates are reported. This variable is therefore constructed by using baseline rates for both the treatment and control group where they are available, followed by, in turn, the baseline rate for the control group; the baseline rate for the treatment group; the endline rate for the control group; the endline rate for the treatment and control group; and the endline rate for the treatment group 10 Regions include: Latin America, Africa, the Middle East and North Africa, East Asia, and South Asia, following the World Bank’s geographical divisions. 29
  • 30. Table 9: Regression of Projects’ Effect Sizes on Characteristics (CCTs on Enrollment Rates) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) ES ES ES ES ES ES ES ES ES ES b/se b/se b/se b/se b/se b/se b/se b/se b/se b/se Enrollment Rates -0.224*** -0.092 -0.127*** (0.05) (0.06) (0.02) Enrolled at Baseline -0.002 (0.02) Not Enrolled at 0.183*** 0.142*** Baseline (0.05) (0.03) Number of -0.011* -0.002 Observations (100,000s) (0.01) (0.00) Rural 0.049** 0.002 (0.02) (0.02) Urban -0.068*** -0.039** (0.02) (0.02) Girls -0.002 (0.03) Boys -0.019 (0.02) Minimum Sample Age 0.005 (0.01) Mean Regional Result 1.000** 0.714** (0.38) (0.28) Observations 112 112 108 130 130 130 130 104 130 92 R2 0.41 0.52 0.01 0.06 0.05 0.00 0.01 0.02 0.01 0.58 30
  • 31. Table 10: Impact of Mixed Models on Measures var(Yi) varRpYi ´ pYiq CV(Yi) CVRpYi ´ pYiq I2 I2 R N Random effects model 0.011 0.011 1.24 1.24 0.97 0.97 122 Mixed model (1) 0.011 0.007 1.28 1.04 0.97 0.96 104 Mixed model (2) 0.012 0.005 1.25 0.85 0.96 0.93 87 As baseline enrollment rates have the strongest relationship to effect size, I use this as an explanatory variable in a hierarchical mixed model, to explore how it affects the residual varRpYi ´ pYiq, CVRpYi ´ pYiq and I2 R. I also use the specification in column (10) of Table 9 as a robustness check. The results are reported in Table 10 for each of these two mixed models, alongside the values from the random effects model that does not use any explanatory variables. Not all papers provide information for each explanatory variable, and each row is based on only those studies which could be used to estimate the model. Thus, the value of varpYiq, CV(Yi) and I2 , which do not depend on the model used, may still vary between rows. In the random effects model, since no explanatory variables are used, pYi is only the mean, and varRpYi ´ pYiq, CVRpYi ´ pYiq and I2 R do not offer improvements on var(Yi), CV(Yi) and I2 . As more explanatory variables are added, the gap between varpYiq and varRpYi ´ pYiq, CV(Yi) and CVRpYi ´ pYiq and I2 and I2 R grows. In all cases, including explanatory variables can help reduce the unexplained variation, to varying degrees. varRpYi ´ pYiq and CVRpYi ´ pYiq are greatly reduced, but I2 R is not much lower than I2 . This is likely due to a feature of I2 (I2 R) previously discussed: that it depends on the precision of estimates. With evaluations of CCT programs tending to have large sample sizes, the value of I2 (I2 R) is higher than it otherwise would be. 4.2.3 How Quickly Do Results Converge? As more studies are completed, our ability to make predictions based on the previous studies’ results might improve. In the overall data set, results do not appear to converge or diverge over time. Figure 5 provides a scatter plot of the relationship between the absolute percent difference between a particular result and the chronological order of the paper relative to others on the same intervention-outcome, scaled to run from 0 to 1. For example, if there were 5 papers on a particular intervention-outcome combination, the first would take the value 0.2, the last, 1. In this figure, attention is restricted to those percent differences less than 1000%. There is a weak positive relationship between them, indicating that earlier results tend to be closer to the mean result than the later results, which are more variable, but this is not signifi- 31
  • 32. Figure 5: Variance of Results Over Time, Within Intervention-Outcome cant. Further, the relationship varies according to the cutoff used. Table 17 in Appendix C illustrates. However, it is still possible that if we can fit a model of the effect sizes to the data, as we did in the case of CCTs, the fit of the model could improve over time as more data are added. To test this, I run the previous OLS regressions of effect size on a constant and baseline enrollment rates using the data available at time period t and measure the absolute error of the predicted values of pYi that would be generated by applying the estimated coefficients to the data from future time periods. I consider prediction error at time period t ` 1 and, separately, the mean absolute prediction error across all future time periods (t ` 1, t ` 2, ...) in alternative specifications. Results regressing the error on the number of papers used to generate the coefficients are shown in Table 11. Since multiple papers may have come out in the same year, there are necessarily discrete jumps in the number of results available at different time periods t, and results are bootstrapped. Overall, it appears that the fit can be improved over time. The fit of model 2, in particular, improves over the first 30-60 studies and afterwards does not show much further reduction in error, though the fit of other models could take longer to converge. It is possible that leveraging within-paper heterogeneity could speed convergence. The next section will explore the relationship between within-study heterogeneity and across-study heterogeneity. 32
  • 33. Table 11: Prediction Error from Mixed Models Declines As Evidence Accumulates Model 1 Model 1 Model 2 Model 2 Absolute Mean Absolute Absolute Mean Absolute Error Error Error Error Number of Previous 0.003 -0.001 -0.014*** -0.043*** Papers (10s) (0.00) (0.00) (0.00) (0.01) Constant 0.042** 0.057*** 0.120*** 0.257*** (0.02) (0.00) (0.02) (0.03) Observations 135 150 111 150 R2 0.01 0.08 0.08 0.42 Columns (1) and (3) focus on the absolute prediction error at time period t ` 1 given the evidence at time t. Columns (2) and (4) focus on the mean absolute prediction error for all time periods t ` 1, t ` 2, .... 4.3 Predicting External Validity from a Single Paper It would be very helpful if we could estimate the across-paper within-intervention- outcome metrics using the results from individual papers. Many papers report results for different subgroups or over time, and the variation in results for a particular intervention- outcome within a single paper could be a plausible proxy of variation in results for that same intervention-outcome across papers. If this relationship holds, it would help researchers estimate the external validity of their own study, even when no other studies on the intervention have been completed. Table 12 shows the results of regressing the across-paper measures of var(Yi) and CV(Yi) on the average within-paper measures for the same intervention-outcome combination. 33
  • 34. Table 12: Regression of Mean Within-Paper Heterogeneity on Across-Paper Heterogeneity (1) (2) (3) Across-paper variance Across-paper CV Across-paper I2 b/se b/se b/se Mean within-paper variance 0.343** (0.13) Mean within-paper CV 0.000* (0.00) Mean within-paper I2 0.543*** (0.10) Constant 0.101* 0.867 0.453*** (0.06) (0.63) (0.08) Observations 51 50 51 R2 0.04 0.00 0.31 The mean of each within-paper measure is created by calculating the measure within a paper, for each paper reporting two or more results on the same intervention-outcome combination, and then averaging that measure across papers within the intervention-outcome. It appears that within-paper variation in results is indeed significantly correlated with across-paper variation in results. Authors could undoubtedly obtain even better estimates using micro data. 4.4 Robustness Checks One may be concerned that low-quality papers are either inflating or depressing the degree of generalizability that is observed. There are infinitely many ways to measure paper “quality”; I consider two. First, I use the most widely-used quality assessment measure, the Jadad scale (Jadad et al., 1996). The Jadad scale asks whether the study was randomized, double-blind, and whether there was a description of withdrawals and dropouts. A paper gets one point for having each of these characteristics; in addition, a point is added if the method of randomization was appropriate, subtracted if the method is inappropriate, and similarly added if the blinding method was appropriate and subtracted if inappropriate. This results in a 0-5 point scale. Given that the kinds of interventions being tested are not typically readily suited to blinding, I consider all those papers scoring at least a 3 to be “high quality”. In an alternative specification, I also consider only those results from studies that were RCTs. This is for two reasons. First, many would consider RCTs to be higher-quality 34
  • 35. studies. We might also be concerned about how specification searching and publication bias could affect results. In a separate paper (Vivalt, 2015a), I discuss these issues at length and find relatively little evidence of these biases in the data, with RCTs exhibiting even fewer signs of specification searching and publication bias. The results based on only those studies which were RCTs thus provide a good robustness check. Tables 15 and 16 in the Appendix provide robustness checks using these two quality measures. Table 14 also includes the one observation previously dropped for having an effect size more than 2 SD away from 0. The heterogeneity measures are not substantially different using these data sets. 5 Conclusion How much impact evaluation results generalize to other settings is an important topic, and data from meta-analyses are the ideal data with which to answer this question. With data on 20 different types of interventions, all collected in the same way, we can begin to speak a bit more generally about how results tend to vary across contexts and what that implies for impact evaluation design and policy recommendations. I started by discussing heterogeneous treatment effects, defining generalizability, and relating generalizability to several possible measures. Each measure has its strengths and limitations, and to get a more complete view multiple measures should be used. I then discussed the rich data set the results are based on and its formation. I presented results for each measure, first looking at the basic measures of variation and proportion of variation that is systematic across intervention-outcome combinations and then looking within the case of a particular intervention-outcome: the effect of CCTs on enrollment rates. Smaller studies tended to have larger effect sizes, which we might expect if the smaller studies are better-targeted, are selected to be evaluated when there is a higher a priori ex- pectation they will have a large effect size, or if there is a preference to report larger effect sizes, which smaller studies would obtain more often by chance. Government-implemented programs also had smaller effect sizes than academic/NGO-implemented programs, even after controlling for sample size. This is unfortunate given we often do smaller impact eval- uations with NGOs in the hopes of finding a strong positive effect that can scale through government implementation. In the case of the effect of CCTs on enrollment rates, the generalizability measures im- prove with the addition of an explanatory mixed model. I also found that the predictive ability of the model improved over time, estimating the model using sequentially larger cuts of the data (i.e. the evidence base at time t, t ` 1...). 35
  • 36. Finally, I compared within-paper heterogeneity in treatment effects to across-paper het- erogeneity in treatment effects. Within-paper heterogeneity is present in my data as papers often report multiple results for the same outcomes, such as for different subgroups. Fortu- nately, I find that even these crude measures of within-paper heterogeneity predict across- paper heterogeneity for the relevant intervention-outcome. This implies that researchers can get a quick estimate how well their results would apply to other settings, simply by using their own data. With their access to micro data, authors could do much richer analysis. Finally, I considered the robustness of these results to specification searching, publication bias (Vivalt, 2015a), and issues of paper quality. A companion paper finds RCTs fare better than non-RCTs with respect to specification searching and publication bias, so I present results based on those studies which are RCTs, as well as separately restricting attention to those studies that meet a common quality standard. I consider several ways to evaluate the magnitude of the variation in results. Whether results are too heterogeneous ultimately depends on the purpose for which they are being used; some policy decisions might have greater room for error than others. However, it is safe to say, looking at both the coefficient of variation and the I2 , which have commonly accepted benchmarks in other disciplines, that these impact evaluations exhibit more heterogeneity than is typical in other fields such as medicine, even after accounting for explanatory vari- ables in the case of conditional cash transfers. Further, I find that under mild assumptions, the typical variance of results is such that a particular program would be mis-predicted by more than 50% over 80% of the time. There are some steps that researchers can take that may improve the generalizability of their own studies. First, just as with heterogeneous selection into treatment (Chassang, Padr´o i Miquel and Snowberg, 2012), one solution would be to ensure one’s impact evalua- tion varied some of the contextual variables that we might think underlie the heterogeneous treatment effects. Given that many studies are underpowered as it is, that may not be likely; however, large organizations and governments have been supporting more impact evaluations, providing more opportunities to explicitly integrate these analyses. Efforts to coordinate across different studies, asking the same questions or looking at some of the same outcome variables, would also help. The framing of heterogeneous treatment effects could also provide positive motivation for replication projects in different contexts: different find- ings would not necessarily negate the earlier ones but add another level of information. In summary, generalizability is not binary but something that we can measure. This paper showed that past results have significant but limited ability to predict other results on the same topic and this was not seemingly due to bias. Knowing how much results tend to extrapolate and when is critical if we are to know how to interpret an impact evaluation’s 36
  • 37. results or apply its findings. Given that other fields, with less heterogeneity, also seem to have a more well-developed practice of replication and meta-analysis, it would seem like economics would have a lot to gain by expanding in this direction. 37
  • 38. References AidGrade (2013). “AidGrade Process Description”, http://www.aidgrade.org/methodology/ processmap-and-methodology, March 9, 2013. AidGrade (2015). “AidGrade Impact Evaluation Data, Version 1.2”. Alesina, Alberto and David Dollar (2000). “Who Gives Foreign Aid to Whom and Why?”, Journal of Economic Growth, vol. 5 (1). Allcott, Hunt (forthcoming). “Site Selection Bias in Program Evaluation”, Quarterly Journal of Economics. Bastardi, Anthony, Eric Luis Uhlmann and Lee Ross (2011). “Wishful Thinking: Belief, Desire, and the Motivated Evaluation of Scientific Evidence”, Psychological Science. Becker, Betsy Jane and Meng-Jia Wu (2007). “The Synthesis of Regression Slopes in Meta-Analysis”, Statistical Science, vol. 22 (3). Bold, Tessa et al. (2013). “Scaling-up What Works: Experimental Evidence on External Validity in Kenyan Education”, working paper. Borenstein, Michael et al. (2009). Introduction to Meta-Analysis. Wiley Publishers. Boriah, Shyam et al. (2008). “Similarity Measures for Categorical Data: A Comparative Evaluation”, in Proceedings of the Eighth SIAM International Conference on Data Mining. Brodeur, Abel et al. (2012). “Star Wars: The Empirics Strike Back”, working paper. Cartwright, Nancy (2007). Hunting Causes and Using Them: Approaches in Philosophy and Economics. Cambridge: Cambridge University Press. Cartwright, Nancy (2010). “What Are Randomized Controlled Trials Good For?”, Philosophical Studies, vol. 147 (1): 59-70. Casey, Katherine, Rachel Glennerster, and Edward Miguel (2012). “Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.” Quarterly Journal of Economics, vol. 127 (4): 1755-1812. Chassang, Sylvain, Gerard Padr I Miquel, and Erik Snowberg (2012). “Selec- tive Trials: A Principal-Agent Approach to Randomized Controlled Experiments.” American Economic Review, vol. 102 (4): 1279-1309. Cohen, Jacob (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Earlbaum Associates. Deaton, Angus (2010). “Instruments, Randomization, and Learning about Development.” Journal of Economic Literature, vol. 48 (2): 424-55. Duflo, Esther, Pascaline Dupas and Michael Kremer (2012). “School Governance, Teacher Incentives and Pupil-Teacher Ratios: Experimental Evidence from Kenyan Primary Schools”, NBER Working Paper. Evans, David and Anna Popova (2014). “Cost-effectiveness Measure- 38
  • 39. ment in Development: Accounting for Local Costs and Noisy Impacts”, World Bank Policy Research Working Paper, No. 7027. Ferguson, Christopher and Michael Brannick (2012). “Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses.” Psychological Methods, vol. 17 (1), Mar 2012, 120-128. Franco, Annie, Neil Malhotra and Gabor Simonovits (2014). “Publication Bias in the Social Sciences: Unlocking the File Drawer”, Working Paper. Gerber, Alan and Neil Malhotra (2008a). “Do Statistical Reporting Standards Affect What Is Published? Publication Bias in Two Leading Political Science Journals”, Quarterly Journal of Political Science, vol 3. Gerber, Alan and Neil Malhotra (2008b). “Publication Bias in Empirical So- ciological Research: Do Arbitrary Significance Levels Distort Published Results”, Sociological Methods &Research, vol. 37 (3). Gelman, Andrew et al. (2013). Bayesian Data Analysis, Third Edition, Chapman and Hall/CRC. Hedges, Larry and Therese Pigott (2004). “The Power of Statistical Tests for Moderators in Meta-Analysis”, Psychological Methods, vol. 9 (4). Higgins, Julian PT and Sally Green, (eds.) (2011). Cochrane Handbook for Systematic Reviews of Interventions, Version 5.1.0 [updated March 2011]. The Cochrane Collaboration. Available from www.cochrane-handbook.org. Higgins, Julian PT et al. (2003). “Measuring inconsistency in meta-analyses”, BMJ 327: 557-60. Higgins, Julian PT and Simon Thompson (2002). “Quantifying heterogeneity in a meta- analysis”, Statistics in Medicine, vol. 21: 1539-1558. Hsiang, Solomon, Marshall Burke and Edward Miguel (2013). “Quantifying the Influence of Climate on Human Conflict”, Science, vol. 341. Independent Evaluation Group (2012). “World Bank Group Impact Evaluations: Relevance and Effectiveness”, World Bank Group. Jadad, A.R. et al. (1996). “Assessing the quality of reports of randomized clinical trials: Is blinding necessary?” Controlled Clinical Trials, 17 (1): 112. Millennium Challenge Corporation (2009). “Key Elements of Evaluation at MCC”, presentation June 9, 2009. Ng, CK (2014). “Inference on the common coefficient of varia- tion when populations are lognormal: A simulation-based approach”, Journal of Statistics: Advances in Theory and Applications, vol. 11 (2). 39
  • 40. Page, Matthew, McKenzie, Joanne and Andrew Forbes (2013). “Many Scenarios Exist for Selective Inclusion and Reporting of Results in Randomized Trials and Systematic Reviews”, Journal of Clinical Epidemiology, vol. 66 (5). Pritchett, Lant and Justin Sandefur (2013). “Context Matters for Size: Why External Validity Claims and Development Practice Don’t Mix”, Center for Global Development Working Paper 336. Rodrik, Dani (2009). “The New Development Economics: We Shall Experiment, but How Shall We Learn?”, in What Works in Development? Thinking Big, and Thinking Small, ed. Jessica Cohen and William Easterly, 24-47. Washington, D.C.: Brookings Institution Press. Saavedra, Juan and Sandra Garcia (2013). “Educational Impacts and Cost-Effectiveness of Conditional Cash Transfer Programs in Developing Countries: A Meta-Analysis”, CESR Working Paper. Simmons, Joseph and Uri Simonsohn (2011). “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”, Psychological Science, vol. 22. Simonsohn, Uri et al. (2014). “P-Curve: A Key to the File Drawer”, Journal of Experimental Psychology: General. Tian, Lili (2005). “Inferences on the common coefficient of variation”, Statistics in Medicine, vol. 24: 2213-2220. Tibshirani, Ryan and Robert Tibshirani (2009). “A Bias Correction for the Minimum Error Rate in Cross-Validation”, Annals of Applied Statistics, vol. 3 (2). Tierney, Michael J. et al. (2011). “More Dollars than Sense: Refining Our Knowledge of Development Finance Using AidData”, World Development, vol. 39. Tipton, Elizabeth (2013). “Improving generalizations from experiments us- ing propensity score subclassification: Assumptions, properties, and contexts”, Journal of Educational and Behavioral Statistics, 38: 239-266. RePEc (2013). “RePEc h-index for journals”, http://ideas.repec.org/top/ top.journals.hindex.html. Vivalt, Eva (2015a). “The Trajectory of Specification Searching Across Disciplines and Methods”, Working Paper. Vivalt, Eva (2015b). “How Concerned Should We Be About Selection Bias, Hawthorne Effects and Retrospective Evaluations?”, Working Paper. Walsh, Michael et. al. (2013). “The Statistical Significance of Randomized Controlled Trial Results is Frequently Fragile: A Case for a Fragility Index”, Journal of Clinical Epidemiology. USAID (2011). “Evaluation: Learning from Experience”, USAID Evaluation Policy, 40
  • 43. Appendices A Guide to Appendices A.1 Appendices in this Paper B) Excerpt from AidGrade’s Process Description (2013). C) Additional results. D) Derivation of mixed model. A.2 Further Online Appendices Having to describe data from twenty different meta-analyses and systematic re- views, I must rely in part on online appendices. The following are available at http://www.evavivalt.com/research: E) The search terms and inclusion criteria for each topic. F) The references for each topic. G) The coding manual. 43
  • 44. B Data Collection B.1 Description of AidGrade’s Methodology The following details of AidGrade’s data collection process draw heavily from AidGrade’s Process Description (AidGrade, 2013). Figure 6: Process Description Stage 1: Topic Identification AidGrade staff members were asked to each independently make a list of at least thirty international development programs that they considered to be the most interesting. The independent lists were appended into one document and duplicates were tagged and removed. Each of the remaining topics was discussed and refined to bring them all to a clear 44
  • 45. and narrow level of focus. Pilot searches were conducted to get a sense of how many impact evaluations there might be on each topic, and all the interventions for which the very basic pilot searches identified at least two impact evaluations were shortlisted. A random subset of the topics was selected, also acceding to a public vote for the most popular topic. Stage 2: Search Each search engine has its own peculiarities. In order to ensure all relevant papers and few irrelevant papers were included, a set of simple searches was conducted on different potential search engines. First, initial searches were run on AgEcon; British Library for Development Studies (BLDS); EBSCO; Econlit; Econpapers; Google Scholar; IDEAS; JOLISPlus; JSTOR; Oxford Scholarship Online; Proquest; PubMed; ScienceDirect; SciVerse; SpringerLink; Social Science Research Network (SSRN); Wiley Online Library; and the World Bank eLibrary. The list of potential search engines was compiled broadly from those listed in other systematic reviews. The purpose of these initial searches was to obtain information about the scope and usability of the search engines to determine which ones would be effective tools in identifying impact evaluations on different topics. External reviews of different search engines were also consulted, such as a Falagas et al. (2008) study which covered the advantages and differences between the Google Scholar, Scopus, Web of Science and PubMed search engines. Second, searches were conducted for impact evaluations of two test topics: deworming and toilets. EBSCO, IDEAS, Google Scholar, JOLISPlus, JSTOR, Proquest, PubMed, ScienceDirect, SciVerse, SpringerLink, Wiley Online Library and the World Bank eLibrary were used for these searches. 9 search strings were tried for deworming and up to 33 strings for toilets, with modifications as needed for each search engine. For each search the number of results and the number of results out of the first 10-50 results which appeared to be impact evaluations of the topic in question were recorded. This gave a better sense of which search engines and which kinds of search strings would return both comprehensive and relevant results. A qualitative assessment of the search results was also provided for the Google Scholar and SciVerse searches. Finally, the online databases of J-PAL, IPA, CEGA and 3ie were searched. Since these databases are already narrowly focused on impact evaluations, attention was restricted to simple keyword searches, checking whether the search engines that were integrated with each database seemed to pull up relevant results for each topic. Ultimately, Google Scholar and the online databases of J-PAL, IPA, CEGA and 3ie, along with EBSCO/PubMed for health-related interventions, were selected for use in the full searches. 45
  • 46. After the interventions of interest were identified, search strings were developed and tested using each search source. Each search string included methodology-specific stock keywords that narrowed the search to impact evaluation studies, except for the search strings for the J-PAL, IPA, CEGA and 3ie searches, as these databases already exclusively focus on impact evaluations. Experimentation with keyword combinations in stages 1.4 and 2.1 was helpful in the development of the search strings. The search strings could take slightly different forms for different search engines. Search terms were tailored to the search source, and a full list is included in an appendix. C# was used to write a script to scrape the results from search engines. The script was programmed to ensure that the Boolean logic of the search string was properly applied within the constraints of each search engines capabilities. Some sources were specialized and could have useful papers that do not turn up in simple searches. The papers listed on J-PAL, IPA, CEGA and 3ies websites are a good example of this. For these sites, it made more sense for the papers to be manually searched and added to the relevant spreadsheets. After the automated and manual searches were complete, duplicates were removed by matching on author and title names. During the title screening stage, the consolidated list of citations yielded by the scraped searches was checked for any existing meta-analyses or systematic reviews. Any papers that these papers included were added to the list. With these references added, duplicates were again flagged and removed. Stage 3: Screening Generic and topic-specific screening criteria were developed. The generic screening crite- ria are detailed below, as is an example of a set of topic-specific screening criteria. The screening criteria were very inclusive overall. This is because AidGrade purposely follows a different approach to most meta-analyses in the hopes that the data collected can be re-used by researchers who want to focus on a different subset of papers. Their motiva- tion is that vast resources are typically devoted to a meta-analysis, but if another team of researchers thinks a different set of papers should be used, they will have scour the literature and recreate the data from scratch. If the two groups disagree, all the public sees are their two sets of findings and their reasoning for selecting different papers. AidGrade instead strives to cover the superset of all impact evaluations one might wish to include along with a list of their characteristics (e.g. where they were conducted, whether they were randomized by individual or by cluster, etc.) and let people set their own filters on the papers or select individual papers and view the entire space of possible results. 46
  • 47. Figure 7: Generic Screening Criteria Category Inclusion Criteria Exclusion Criteria Methodologies Impact evaluations that have counterfactuals Observational studies, strictly qualitative studies Publication status Peer-reviewed or working paper N/A Time period of study Any N/A LocationGeography Any N/A Quality Any N/A Figure 8: Topic-Specific Criteria Example: Formal Banking Category Inclusion Criteria Exclusion Criteria Intervention Formal banking services specifically including: Other formal banking services - Expansion of credit and/or savings Microfinance - Provision of technological innovations - Introduction or expansion of financial education, or other program to increase financial literacy or awareness Outcomes - Individual and household income N/A - Small and micro-business income - Household and business assets - Household consumption - Small and micro-business investment - Small, micro-business or agricultural output - Measures of poverty - Measures of well-being or stress - Business ownership - Any other outcome covered by multiple papers Figure 11 illustrates the difference. For this reason, minimal screening was done during the screening stage. Instead, data was collected broadly and re-screening was allowed at the point of doing the analysis. This is highly beneficial for the purpose of this paper, as it allows us to look at the largest possible set of papers and all subsets. After screening criteria were developed, two volunteers independently screened the titles to determine which papers in the spreadsheet were likely to meet the screening criteria developed in Stage 3.1. Any differences in coding were arbitrated by a third volunteer. All volunteers received training before beginning, based on the AidGrade Training Manual and a test set of entries. Volunteers’ training inputs were screened to ensure that only proficient 47
  • 48. Figure 9: AidGrade’s Strategy 48
  • 49. volunteers would be allowed to continue. Of those papers that passed the title screening, two volunteers independently determined whether the papers in the spreadsheet met the screening criteria developed in Stage 3.1 judging by the paper abstracts. Any differences in coding were again arbitrated by a third volunteer. The full text was then found for those papers which passed both the title and abstract checks. Any paper that proved not to be a relevant impact evaluation using the aforementioned criteria was discarded at this stage. Stage 4: Coding Two AidGrade members each independently used the data extraction form developed in Stage 4.1 to extract data from the papers that passed the screening in Stage 3. Any disputes were arbitrated by a third AidGrade member. These AidGrade members received much more training than those who screened the papers, reflecting the increased difficulty of their work, and also did a test set of entries before being allowed to proceed. The data extraction form was organized into three sections: (1) general identifying information; (2) paper and study characteristics; and (3) results. Each section contained qualitative and quantitative variables that captured the characteristics and results of the study. Stage 5: Analysis A researcher was assigned to each meta-analysis topic who could specialize in determin- ing which of the interventions and results were similar enough to be combined. If in doubt, researchers could consult the original papers. In general, researchers were encouraged to focus on all the outcome variables for which multiple papers had results. When a study had multiple treatment arms sharing the same control, researchers would check whether enough data was provided in the original paper to allow estimates to be combined before the meta-analysis was run. This is a best practice to avoid double-counting the control group; for details, see the Cochrane Handbook for Systematic Reviews of Interventions (2011). If a paper did not provide sufficient data for this, the researcher would make the decision as to which treatment arm to focus on. Data were then standardized within each topic to be more comparable before analysis (for example, units were converted). The subsequent steps of the meta-analysis process are irrelevant for the purposes of this paper. It should be noted that the first set of ten topics followed a slightly different procedure for stages (1) and (2). Only one list of potential topics was created in Stage 1.1, so Stage 1.2 (Consolidation of Lists) was only vacuously followed. There was also no randomization after public voting (Stage 1.7) and no scripted scraping searches (Stage 2.3), as all searches were manually conducted using specific strings. A different search engine was 49
  • 50. also used: SciVerse Hub, an aggreator that includes SciVerse Scopus, MEDLINE, PubMed Central, ArXiv.org, and many other databases of articles, books and presentations. The search strings for both rounds of meta-analysis, manual and scripted, are detailed in another appendix. 50
  • 52. Table 13: Descriptive Statistics: Standardized Narrowly Defined Outcomes Intervention Outcome # Neg sig papers # Insig papers # Pos sig papers # Papers Conditional cash transfers Attendance rate 0 6 9 15 Conditional cash transfers Enrollment rate 0 6 31 37 Conditional cash transfers Height 0 1 1 2 Conditional cash transfers Height-for-age 0 6 1 7 Conditional cash transfers Labor force participation 1 12 5 18 Conditional cash transfers Probability unpaid work 1 0 4 5 Conditional cash transfers Test scores 1 2 2 5 Conditional cash transfers Unpaid labor 0 2 3 5 Conditional cash transfers Weight-for-age 0 2 0 2 Conditional cash transfers Weight-for-height 0 1 1 2 HIV/AIDS Education Pregnancy rate 0 2 0 2 HIV/AIDS Education Probability has multiple sex partners 0 1 1 2 HIV/AIDS Education Used contraceptives 1 6 3 10 Unconditional cash transfers Enrollment rate 0 3 8 11 Unconditional cash transfers Test scores 0 1 1 2 Unconditional cash transfers Weight-for-height 0 2 0 2 Insecticide-treated bed nets Malaria 0 3 6 9 Contract teachers Test scores 0 1 2 3 Deworming Attendance rate 0 1 1 2 Deworming Birthweight 0 2 0 2 Deworming Diarrhea incidence 0 1 1 2 Deworming Height 3 10 4 17 Deworming Height-for-age 1 9 4 14 Deworming Hemoglobin 0 13 2 15 Deworming Malformations 0 2 0 2 Deworming Mid-upper arm circumference 2 0 5 7 Deworming Test scores 0 0 2 2 Deworming Weight 3 8 7 18 Deworming Weight-for-age 1 6 5 12 Deworming Weight-for-height 2 7 2 11 Financial literacy Savings 0 2 3 5 Improved stoves Chest pain 0 0 2 2 Improved stoves Cough 0 0 2 2 Improved stoves Difficulty breathing 0 0 2 2 Improved stoves Excessive nasal secretion 0 1 1 2 Irrigation Consumption 0 1 1 2 Irrigation Total income 0 1 1 2 52
  • 53. Microfinance Assets 0 3 1 4 Microfinance Consumption 0 2 0 2 Microfinance Profits 1 3 1 5 Microfinance Savings 0 3 0 3 Microfinance Total income 0 3 2 5 Micro health insurance Enrollment rate 0 1 1 2 Micronutrient supplementation Birthweight 0 4 3 7 Micronutrient supplementation Body mass index 0 1 4 5 Micronutrient supplementation Cough prevalence 0 3 0 3 Micronutrient supplementation Diarrhea incidence 1 5 5 11 Micronutrient supplementation Diarrhea prevalence 0 5 1 6 Micronutrient supplementation Fever incidence 0 2 0 2 Micronutrient supplementation Fever prevalence 1 2 2 5 Micronutrient supplementation Height 3 22 7 32 Micronutrient supplementation Height-for-age 5 23 8 36 Micronutrient supplementation Hemoglobin 7 11 29 47 Micronutrient supplementation Malaria 0 2 0 2 Micronutrient supplementation Mid-upper arm circumference 2 9 7 18 Micronutrient supplementation Mortality rate 0 12 0 12 Micronutrient supplementation Perinatal deaths 1 5 0 6 Micronutrient supplementation Prevalence of anemia 0 6 9 15 Micronutrient supplementation Stillbirths 0 4 0 4 Micronutrient supplementation Stunted 0 5 0 5 Micronutrient supplementation Test scores 1 2 7 10 Micronutrient supplementation Triceps skinfold measurement 1 0 1 2 Micronutrient supplementation Wasted 0 2 0 2 Micronutrient supplementation Weight 4 19 13 36 Micronutrient supplementation Weight-for-age 1 23 10 34 Micronutrient supplementation Weight-for-height 0 18 8 26 Mobile phone-based reminders Appointment attendance rate 1 0 2 3 Mobile phone-based reminders Treatment adherence 1 3 1 5 Performance pay Test scores 0 2 1 3 Rural electrification Enrollment rate 0 1 2 3 Rural electrification Study time 0 1 2 3 Rural electrification Total income 0 2 0 2 Safe water storage Diarrhea incidence 0 1 1 2 Scholarships Attendance rate 0 1 1 2 Scholarships Enrollment rate 0 2 3 5 Scholarships Test scores 0 2 0 2 53
  • 54. School meals Enrollment rate 0 3 0 3 School meals Height-for-age 0 2 0 2 School meals Test scores 0 2 1 3 Water treatment Diarrhea incidence 0 1 1 2 Water treatment Diarrhea prevalence 0 1 5 6 Women’s empowerment programs Savings 0 1 1 2 Women’s empowerment programs Total income 0 0 2 2 Average 0.6 4.2 3.2 7.9 54
  • 55. Table 14: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes, Including Outlier Intervention Outcome var(Yi) CV(Yi) I2 Microfinance Assets 0.000 5.508 0.999 Rural Electrification Enrollment rate 0.001 0.129 0.993 Micronutrients Cough prevalence 0.001 1.648 0.829 Microfinance Total income 0.001 0.989 0.998 Microfinance Savings 0.002 1.773 0.922 Financial Literacy Savings 0.004 5.472 0.979 Microfinance Profits 0.005 5.448 0.519 Contract Teachers Test scores 0.005 0.403 0.998 Performance Pay Test scores 0.006 0.608 0.552 Micronutrients Body mass index 0.007 0.675 1.000 Conditional Cash Transfers Unpaid labor 0.009 0.918 0.836 Micronutrients Weight-for-age 0.009 1.941 0.663 Micronutrients Weight-for-height 0.010 2.148 0.416 Micronutrients Birthweight 0.010 0.981 0.997 Micronutrients Height-for-age 0.012 2.467 0.640 Conditional Cash Transfers Test scores 0.013 1.866 0.887 Deworming Hemoglobin 0.015 3.377 0.996 Micronutrients Mid-upper arm circumference 0.015 2.078 0.317 SMS Reminders Treatment adherence 0.022 1.672 0.050 Micronutrients Height 0.023 4.369 0.991 Micronutrients Mortality rate 0.025 2.880 0.698 Micronutrients Stunted 0.025 1.110 0.665 Bed Nets Malaria 0.029 0.497 1.000 Conditional Cash Transfers Attendance rate 0.030 0.523 0.362 Micronutrients Weight 0.034 2.705 0.708 HIV/AIDS Education Used contraceptives 0.037 3.044 0.867 Micronutrients Perinatal deaths 0.038 2.096 0.108 Deworming Height 0.049 2.310 0.995 Micronutrients Test scores 0.052 1.694 0.891 Conditional Cash Transfers Height-for-age 0.055 22.166 0.125 Conditional Cash Transfers Enrollment rate 0.056 1.287 1.000 Deworming Weight-for-height 0.072 3.129 0.910 Micronutrients Stillbirths 0.075 3.041 0.955 Micronutrients Prevalence of anemia 0.095 0.793 0.268 Deworming Height-for-age 0.098 1.978 0.944 Deworming Weight-for-age 0.107 2.287 0.993 Micronutrients Diarrhea incidence 0.109 3.300 0.663 Micronutrients Diarrhea prevalence 0.111 1.205 0.815 Micronutrients Fever prevalence 0.146 3.076 0.959 Deworming Weight 0.165 3.897 0.999 Micronutrients Hemoglobin 0.215 1.439 0.269 55
  • 56. SMS Reminders Appointment attendance rate 0.224 2.908 0.913 Deworming Mid-upper arm circumference 0.439 1.773 1.000 Conditional Cash Transfers Probability unpaid work 0.609 6.415 1.000 Conditional Cash Transfers Labor force participation 0.789 2.972 0.461 56