SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,               VOL. 31,    NO. 9,      SEPTEMBER 2005                                                    733




                  A Survey of Controlled Experiments
                       in Software Engineering
Dag I.K. Sjøberg, Member, IEEE Computer Society, Jo E. Hannay, Ove Hansen, Vigdis By Kampenes,
                  Amela Karahasanovi, Nils-Kristian Liborg, and Anette C. Rekdal
                                    c

       Abstract—The classical method for identifying cause-effect relationships is to conduct controlled experiments. This paper reports
       upon the present state of how controlled experiments in software engineering are conducted and the extent to which relevant
       information is reported. Among the 5,453 scientific articles published in 12 leading software engineering journals and conferences in
       the decade from 1993 to 2002, 103 articles (1.9 percent) reported controlled experiments in which individuals or teams performed one
       or more software engineering tasks. This survey quantitatively characterizes the topics of the experiments and their subjects (number
       of subjects, students versus professionals, recruitment, and rewards for participation), tasks (type of task, duration, and type and size
       of application) and environments (location, development tools). Furthermore, the survey reports on how internal and external validity is
       addressed and the extent to which experiments are replicated. The gathered data reflects the relevance of software engineering
       experiments to industrial practice and the scientific maturity of software engineering research.

       Index Terms—Controlled experiments, survey, research methodology, empirical software engineering.

                                                                                 æ

1    INTRODUCTION

T    HERE  is an increasing understanding in the software
    engineering community that empirical studies are
needed to develop or improve processes, methods, and
                                                                                     Empirical Assessment  Evaluation in Software Engineer-
                                                                                     ing (EASE) is partially included in that 10 selected articles
                                                                                     from EASE appear in special issues of JSS, EMSE, and
tools for software development and maintenance [6], [4],                             IST. We consider the above journals to be leaders in
[35], [16], [43], [5], [32], [41], [50], [15]. An important                          software engineering. ICSE is the principal conference in
category of empirical study is that of the controlled                                software engineering, and ISESE, Metrics, and EASE are
experiment, the conducting of which is the classical                                 major venues in empirical software engineering that
scientific method for identifying cause-effect relationships.                        report a relatively high proportion of controlled software
   This paper reports on a survey that quantitatively                                engineering experiments.
characterizes the controlled experiments in software                                    Research in empirical software engineering should aim
engineering published in nine journals and three con-                                to acquire general knowledge about which technology
ference proceedings in the decade from 1993 to 2002. The                             (process, method, technique, language, or tool) is useful
journals are ACM Transactions on Software Engineering                                for whom to conduct which (software engineering) tasks in
Methodology (TOSEM), Empirical Software Engineering                                  which environments. Hence, this survey focuses on the kind
(EMSE), IEEE Computer, IEEE Software, IEEE Transactions                              of technology being studied in the experiments investigated
on Software Engineering (TSE), Information and Software                              (which reflects the topics of the experiments), the subjects
Technology (IST), Journal of Systems and Software (JSS),                             that took part, the tasks they performed, the type of
Software Maintenance and Evolution (SME), and Software:                              application systems on which these tasks were performed,
Practice and Experience (SPE). The conferences are the                              and the environments in which the experiments were
International Conference on Software Engineering (ICSE),                             conducted. This survey also includes data on experiment
                                                                                     replication and the extent to which internal and external
the IEEE International Symposium on Empirical Software
                                                                                     validity is discussed.
Engineering (ISESE), and the IEEE International Sympo-
                                                                                        The paper is organized as follows: Section 2 describes
sium on Software Metrics (METRICS). The conference                                   related work. Section 3 defines the research method for the
                                                                                     survey. Section 4 reports the extent of controlled experi-
. D.I.K. Sjøberg, J.E. Hannay, O. Hansen, V.B. Kampenes, and                         ments, and Sections 5-10 report our main findings.
  A. Karahasanovi are with the Simula Research Laboratory, Department
                    c                                                                Section 11 discusses threats to the validity of this survey.
  of Software Engineering, P.O. Box 134, NO-1325 Lysaker, Norway.                    Section 12 summarizes.
  E-mail: {dagsj, johannay, oveha, vigdis, amela}@simula.no.
. N.-K. Liborg is with BNP Paribas, Karl Johansgt. 7, NO-0154 Oslo,
  Norway. E-mail: nils@liborg.no.
. A.C. Rekdal is with Unified Consulting, Ruseløkkveien 14, NO-0251 Oslo,
                                                                                     2     RELATED WORK
  Norway. E-mail: acr@unified.no.                                                    Table 1 summarizes the purpose, scope, and extent of
Manuscript received 6 Jan. 2005; revised 22 June 2005; accepted 27 June 2005;        four major surveys in addition to this survey. Tichy et al.
published online 15 Sept. 2005.                                                      [43] compare the amount of experimental work published
Recommended for acceptance by B. Littlewood.
For information on obtaining reprints of this article, please send e-mail to:        in a few computer science journals and conference
tse@computer.org, and reference IEEECS Log Number TSE-0005-0105.                     proceedings with the amount of experimental work
                                               0098-5589/05/$20.00 ß 2005 IEEE       Published by the IEEE Computer Society
734                                              IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,            VOL. 31,   NO. 9,   SEPTEMBER 2005


                                                          TABLE 1
                                    Surveys of Empirical Studies in Software Engineering




published in a journal on artificial neural network and a       3       RESEARCH METHOD
journal on optical engineering. In total, 403 articles were
                                                                This section describes the kind of experiments that are
surveyed and classified into five categories: formal theory,
                                                                considered in this survey and the procedure for identifying
design and modeling, empirical work, hypothesis testing, and
                                                                and analyzing the relevant articles.
other. Zelkowitz and Wallace [49] propose a taxonomy of
empirical studies in software engineering and report a          3.1  Controlled Experiments in Software
survey in which 612 articles were classified within this             Engineering
taxonomy. Glass et al. [20] investigate 369 articles with       Shadish et al. [37] provide the following definitions:
respect to topic, research approach, research method,
reference discipline, and level of analysis.                        .  Experiment: A study in which an intervention is
   The above surveys give a comprehensive picture of                   deliberately introduced to observe its effects.
                                                                   . Randomized experiment: An experiment in which
research methods used in software engineering. They differ
                                                                       units are assigned to receive the treatment or an
in purpose, selection criteria, and taxonomies. Nevertheless,
                                                                       alternative condition by a random process such as
their results suggest the same conclusions: The majority of
                                                                       the toss of a coin or a table of random numbers.
published articles in computer science and software
                                                                   . Quasi-Experiment: An experiment in which units are
engineering provide little or no empirical validation and              not assigned to conditions randomly.
the proportion of controlled experiments is particularly low.      . Correlation study: Usually synonymous with nonex-
The surveys propose means to increase the amount of                    perimental or observational study; a study that
empirical studies and their quality.                                   simply observes the size and direction of a relation-
   The major difference between those surveys and ours is              ship among variables.
that they describe the extent and characteristics of various       To identify the effects of the deliberate intervention in an
types of empirical study, while we provide an in-depth          experiment, factors that may influence the outcome, in
study of controlled experiments only. A comparison of           addition to the treatment, should also be controlled.1 This is
those surveys and ours regarding the extent of controlled       the challenge of internal validity (see Section 10.1). Note
experiments is provided in Section 4.                           that control is not an all or nothing condition; the degree of
   In addition to the general surveys described above, there
                                                                control varies on a continuous scale.
are several surveys within subdisciplines of software en-
                                                                   Based on the definitions given above, we present an
gineering, for example, object-oriented technology [14],
                                                                operational definition used for this survey. Since the term
testing techniques [28], and software effort estimation [25].
Furthermore, Shaw [38] categorizes the research reported in        1. Some definitions are very explicit on the aspect of control, for example,
articles submitted and accepted for ICSE 2002 and Zendler       Zimney [52] defines a psychological experiment as an “objective observa-
                                                                tion of phenomena which are made to occur in a strictly controlled situation
[51] reports a survey of 31 experiments with the aim of         in which one or more factors are varied and the others are kept constant,”
developing a preliminary theory about software engineering.     see discussion of this definition in [10].
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                                           735


                                                          TABLE 2
                                        Aspects and Their Attributes for Data Extraction




“experiment” is inconsistently used in the software en-          3.2   Identification of Articles that Report Controlled
gineering community (often used synonymously with                      Experiments
empirical study), we use the term “controlled experiment:”       In order to identify and extract controlled experiments,
  Controlled experiment in software engineering (operational     one researcher systematically read the titles and abstracts of
  definition): A randomized experiment or a quasi-experiment     5,453 scientific articles published in the selected journals
  in which individuals or teams (the experimental units)         and conference proceedings for the period 1993-2002.
  conduct one or more software engineering tasks for the sake    Excluded from the search were editorials, prefaces, article
  of comparing different populations, processes, methods,
                                                                 summaries, interviews, news, reviews, correspondence,
  techniques, languages, or tools (the treatments).
                                                                 discussions, comments, reader’s letters, and summaries of
   We do not distinguish between randomized experiments          tutorials, workshops, panels, and poster sessions.
and quasi-experiments in this survey because both experi-           If it was unclear from the title or abstract whether a
mental designs are relevant to empirical software engineer-      controlled experiment was described, the entire article was
ing experimentation. Random assignment of experimental           read by both the same researcher and another person in the
units to treatments may not always be feasible, e.g., for        project team. In the end, 103 articles were selected. Note
logistic reasons. For example, one of the surveyed experi-       that identifying the relevant articles is not straightforward
ments used units formed from existing training groups in a       because the terminology in this area is confusing. For
company—random assignment would, in this case, have              example, several authors claim that they describe experi-
disturbed the training process.                                  ments even though no treatment is applied in the study.
   We exclude several types of study that share certain
characteristics with controlled experiments because, while       3.3 Analysis of the Articles
these may be highly relevant for the field, they do not apply     The survey data is stored in a relational database (MS SQL
the deliberate intervention essential to controlled experi-      Server 2000).2 Some data is specific to an article, some is
ments. Thus, we exclude correlation studies, studies that are    specific to an experiment, and some information concerns
solely based on calculations on existing data (e.g., from data   the combination of article and experiment. Moreover, an
mining), and evaluations of simulated teams based on data        article might describe several experiments and an experi-
for individuals. The last category falls outside our opera-      ment might be described in several articles, typically with a
tional definition because the units are constructed after the    different focus in each article. Consequently, we defined a
run of the experiment.                                           data model with the entities article, experiment, and article-
   Studies that use projects or companies as treatment           experiment with a set of attributes relevant to our survey. In
groups, in which data is collected at several levels             addition to the survey database, a catalog of all the articles
(treatment defined, but no experimental unit defined) are        in searchable pdf-format was generated. (About 3/4 of the
also excluded because we consider these to be multiple case      articles were provided in searchable pdf-format by the
studies [47] (even though the authors might refer to them as     journal publishers; the remaining 1/4 were OCR-scanned.)3
experiments). Our survey focuses on articles that provide        The articles were analyzed according to the six aspects
the main reporting of experiments. This excludes articles        listed in Table 2. Each aspect encompasses a set of attributes
that, at the outset, would not provide sufficient data for our   for data extraction.
analyses (e.g., summaries of research programs). Moreover,
usability experiments are not included since we regard              2. MS SQL Server 2000 is a registered trademark of Microsoft Corp.
                                                                    3. The survey database and catalog of articles may be provided upon
those as part of another discipline (human computer              request to the corresponding author and under the conditions of a signed
interaction).                                                    agreement toward the use of the data.
736                                                          IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,    VOL. 31,   NO. 9,   SEPTEMBER 2005


                                                                     TABLE 3
                                                    Articles that Report Controlled Experiments




   Six researchers analyzed the articles so that each aspect               definition is narrower than those of the other studies or
above was covered by at least two persons. After the initial               because our investigation spans more sources and years.
analysis, the results were compared and possible conflicts                    We rank institutions and scholars according to the
resolved by reviewing the articles collectively a third time               number of experiments published (not the quality), but
or handing over the article to a third person. The main                    relative to their fractional representation on the article(s)
analysis tool was SAS.4                                                    that reports the experiment. Glass and Chen [18] also
                                                                           ranked institutions and scholars but according to publication
                                                                           in systems and software engineering, and they used a more
4     EXTENT
                                                                           complex ranking scheme for scholars.
Controlled experiments, as defined in Section 3.1, are                        In total, 207 scholars are involved in the experiments of
reported in 103 (1.9 percent) of the 5,453 articles scanned                our survey. Table 4 presents the top 20 ranked scholars. Due
for this survey, see Table 3. The 103 articles report a total of           to the fractional distribution, the number of experiments in
113 controlled experiments. Twelve articles report more                    which a scholar has actually been involved is typically
than one experiment and four experiments are reported in                   higher than the scores in Table 4. For instance, the top
several articles.                                                          ranked scholar, Giuseppe Visaggio, was involved in
   EMSE, ISESE, and METRICS, which focus specifically on                   six experiments described in four papers authored by one
empirical software engineering, report a higher proportion                 to three scholars, resulting in a fractional score of 4.2 experi-
of controlled experiments than the other journals and the                  ments. Among the 20 top ranked scholars, three (Laiten-
ICSE conference. The mean percentage of controlled                         berger, Roper, Wood) were involved in eight experiments,
experiments across years varies between 0.6 and 3.5, but                   one was involved in seven, four in six, two in five, nine in
we see no marked trend over years. An overview of the                      four, and one was involved in three experiments.
trend for the individual journals/conferences is presented                    There are 109 institutions from 19 countries involved in
in the Appendix.                                                           the experiments of our survey. The scores for institutions
   The surveys summarized in Table 1 also report extent.                   are accumulated from the scores of affiliated authors.
Tichy et al. have the study type definition with the broadest              Table 5 presents the top 10 ranked institutions.
scope and report that 14 percent of the articles published in                 The institution that has used most professionals as
the specific software engineering journals TSE and TOPLAS                  subjects throughout the surveyed time period is Fraunhofer
(Transactions on Programming Languages and Systems) de-                    Institute, Kaiserslautern. In total, they used 100 professionals
scribe empirical work. In Glass et al., the authors classify               in six experiments, ranging from 11 to 20 in a single
3 percent of the articles as laboratory experiments using                  experiment. The institution that conducted the experiment
human subjects and  1 percent as field experiments.                       involving the largest number (68) of professionals as subjects
According to the survey by Zelkowitz and Wallace,                          was Andersen Consulting (now Accenture), Norway.
experiments defined as controlled methods are reported
in 2.9 percent of the articles. Our survey finds the lowest
                                                                           5   TOPICS
percentage of articles (1.9 percent) that report controlled
experiments. This might be because our study type                          This section describes two classifications of the 103 analyzed
                                                                           articles according to their main topic. The first classification
    4. SAS is a registered trademark of SAS Institute Inc.                 illustrates the experiments’ discipline coverage relative to
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                                   737


                                                         TABLE 4
                    Top 20 Scholars Conducting Controlled Experiments in Software Engineering 1993-2002




software engineering as a whole, while the second                  uniformity across all three fields [19]. The scheme is,
classification has a more technical focus on software              therefore, somewhat general. On the other hand, this
engineering method and methodology. The analysis is with           scheme has actually been used in classifying work under-
respect to article, rather than experiment; this is adequate       taken in software engineering and can therefore be used for
since no two experiments on different topics are reported in       illustrating the relative topic coverage of controlled experi-
the same article. Both classifications emphasize the treat-        ments in software engineering.
ment of an experiment, since treatment, being the interven-            Fig. 1 shows the distribution to topic categories of
tion of interest (Section 3), indicates the de facto topic under   controlled experiments in software engineering relative to
investigation.                                                     software engineering research in general. Controlled ex-
5.1 Classification Scheme: Glass et al.                            periments seem to cover at least the categories that are well
There are a number of classification schemes for computing         represented by general SE research, but remember that the
science and software engineering, e.g., SWEBOK [1] and             overall number of controlled experiments performed is low
Glass et al. [20]. The classification scheme of Glass et al. is    (Section 2). Recall that experiments on topics purely within
aimed at positioning software engineering research relative        human computer interaction are not included in this
to a backdrop of overall computing disciplines, i.e.,              survey, as is the case for topics purely within information
computer science, software engineering, and information            systems. Our focus on experiments with human subjects
systems, and their classification categories are meant to give     also excludes a range of software engineering topics.


                                                           TABLE 5
                   Top 10 Institutions Conducting Controlled Experiments in Software Engineering 1993-2002
738                                                    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,          VOL. 31,   NO. 9,   SEPTEMBER 2005




Fig. 1. Comparative distribution to topic of software engineering research and software engineering controlled experiments using the scheme of
Glass et al. Only nonvoid categories are shown.

   The two most prominent categories are Software life-                     The number of participants per experiment ranges from
cycle/engineering (49 percent) and Methods/Techniques                   four to 266, with a mean value of 48.6 (Table 8). Students
(32 percent) due to, respectively, the relatively large number          participated in 91 (81 percent) of the experiments, either
of experiments on inspection techniques and object-oriented             alone or together with professionals and/or scientists, and
design techniques.                                                      professionals took part in 27 experiments (24 percent). The
                                                                        use of professionals as subjects has been relatively stable
5.2   Classification Scheme: IEEE Keyword
                                                                        over time. Undergraduates are used much more often than
      Taxonomy
                                                                        graduate students. For one experiment, no information
The IEEE Keyword Taxonomy [24] provides a more                          about subject type was provided; for eight experiments, no
technical perspective than the scheme of Glass et al. [20].             details about type of students were given; and for
This taxonomy is recommended for authors of IEEE articles,              five experiments with mixed types of subject, no informa-
and is an extended version of the ACM Computing                         tion about the number in each category was provided.
Classification [2].                                                         The issue of differences between students and profes-
   We use the IEEE keywords to denote topic categories.                 sionals has been discussed in the literature [13], [12], [48], [36].
The classification according to the IEEE Keyword Taxon-
                                                                        Interestingly, while seven articles describe experiments using
omy is given in Table 6. The two prominent technical areas
                                                                        both students and professionals, only three of them measure
are Code inspections and walkthroughs (35 percent) and Object-
                                                                        the difference in performance between the two groups. In the
oriented design methods (8 percent). The numbers of experi-
                                                                        first experiment, categorized as Software psychology, three
ments are limited for other areas.
                                                                        programming tasks were performed. For two of the tasks,
                                                                        there was no difference between the groups, whereas, for the
6     SUBJECTS                                                          third task, the professionals were significantly better. In the
This section describes the kind of subjects that take part in           second experiment, also in Software psychology, there was no
the experiments, the kind of information that is reported               difference. In the third experiment, categorized as Main-
about them, and how they are recruited.                                 tenance process, the professionals were significantly better.
                                                                            The performance in an experiment may differ between
6.1   Number and Categories of Subjects in the                          subcategories of subjects, that is, there may be an interaction
      Experiments                                                       effect between treatment and subcategory [3]. However,
In total, 5,488 subjects took part in the 113 experiments               none of the surveyed experiments distinguished between
investigated in this survey. Eighty-seven percent were                  subcategories of professionals or students.
students and nine percent were professionals. The reported                  The experiment with the highest number of professionals
subject types are divided into the categories given in Table 7.         (68) was classified as Cost estimation in the IEEE-taxonomy
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                                         739


                                                               TABLE 6
                                      Classification of Articles According to IEEE Taxonomy




(Table 6). Then, there were five experiments with                   “Non-random drop-out of subjects has been avoided by the
29-35 professionals, of which one also employed 20 students.        experimental design, i.e. assignment of groups only on the second
These five were categorized in descending order (regarding          day of the experiment, i.e., directly before the treatment, and not
number of subjects) as Modules and Interfaces, Code Inspec-         before the pre-test already on the first day of the experiment.”
tions and walkthroughs, Maintenance process, Software Psychol-      However, most articles say nothing about how mortality
ogy (understanding code), and Patterns.                             was managed.
   The total number of participants was reported in all the            There are good reasons for conducting experiments with
articles, either explicitly or implicitly; in the latter case, we   students as subjects, for example, for testing experimental
could roughly calculate the number (for instance, from the          design and initial hypotheses, or for educational purposes
information that 10 teams averaging four subjects partici-          [42]. Depending on the actual experiment, students may also
pated). Subject mortality (drop-outs) was reported in               be representative of junior/inexperienced professionals.
24 experiments (2 percent mortality on average). Even in            However, the low proportion of professionals used in
experiments with as many as 266 subjects (as well as many           software engineering experiments reduces experimental
other experiments with a relatively high number of                  realism, which in turn may inhibit the understanding of
subjects), no mortality was reported. One article states that       industrial software processes and, consequently, technology
740                                              IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,      VOL. 31,   NO. 9,   SEPTEMBER 2005


                                                          TABLE 7
                                                      Subject Categories




transfer from the research community to industry. Hence, to      level of detail reported varies substantially. An example of
break the trend of few professionals as subjects, new            detailed information on programming experience is: “On
strategies are needed to overcome these challenges, see          average, subjects’ previous programming experience was 7.5 years,
e.g., discussions in [39], [40].                                 using 4.6 different programming languages with a largest
                                                                 program of 3510 LOC. Before the course, 69 percent of the
6.2 Information about Subjects                                   subjects had some previous experience with object-oriented
In order to generalize from an experiment with a given           programming, 58 percent with programming GUIs.” An
group of subjects (sample population), one needs informa-        example of a high-level description without figures is:
tion about various characteristics and their variation both in   “Some of the students had industrial programming experience.”
the sample and in the group to which the results will be            For the 91 experiments with students, the following
generalized (target population) [7]. For professionals,          information was reported: gender (seven experiments), age
depending on what we wish to study, it would be relevant         (six experiments), grades (six experiments), programming
to know the variations regarding competence, productivity,       experience (general description: 17 experiments, number of
education, experience (including domains), age, culture, etc.    years/languages: 11 experiments), work experience in
However, there is no generally accepted set of background        industry (general description: nine experiments, number
variables for guiding data collection in a given type of         of years: nine experiments), task-related experience (64 ex-
study, simply because the software engineering community         periments), and task-related training (27 experiments). The
does not know which variables are the important ones. We         training was either tailored specifically for the experiment
have chosen to focus on the variables that are reported in       or was part of a course, or the experiment could be
the analyzed articles, that is, gender, age, education,          conducted as part of a training session.
experience, and task-related training.                              For the 27 experiments with professionals, more details
   The analyzed articles vary to a large extent on how they      on the subjects’ background were given. Categories of
report such information. For 14 of the 113 experiments, no       professionals such as reviewers, analysts, programmers, and
information about the subjects was reported. Moreover, the       managers were given for seven experiments. Subjects’


                                                          TABLE 8
                                                 Participants in Experiments
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                                                 741


                          TABLE 9                                      If one applies statistical hypothesis testing, a requirement
                     Subject Reward Data                           is to have a well-defined population from which the sample
                                                                   is drawn: “If you cannot define the population from which your
                                                                   subjects/objects are drawn, it is not possible to draw any inference
                                                                   from the results of your experiment” [30].5 Nevertheless, only
                                                                   one of the experiments in this survey that apply statistical
                                                                   hypothesis testing actually reported sampling from a well-
                                                                   defined target population.
                                                                       For only a couple of experiments, random sampling of
                                                                   subjects was claimed. How the random sampling was
                                                                   carried out was not described. The dominant approach was
                                                                   convenience sampling: “Subjects are selected because of their
                                                                   convenient accessibility to the researcher. These subjects are
degrees were described for three experiments. Gender and           chosen simply because they are the easiest to obtain for the study.
age were given for, respectively, two and three experiments.       This technique is easy, fast and usually the least expensive and
Language and nationality were given for one experiment             troublesome. ... The criticism of this technique is that bias is
                                                                   introduced into the sample.” [34]. This does not mean that
(subjects from two countries participated). A general
                                                                   convenience sampling is generally inappropriate. For
description of programming experience was given for
                                                                   example, Ferber [17] refers to the exploratory, the illustrative,
two experiments. Programming experience in years/lan-              and the clinical situations in which convenience sampling
guages was given for seven experiments. A self-assessment          may be appropriate. In software engineering, the most
of programming experience was reported for two experi-             convenient way of recruiting subjects is to use the students
ments. Work experience in years was given for                      that are taught by the researcher. (Note that convenience
five experiments. A general description of task-related            sampling is also common in other disciplines such as
experience was reported in one experiment. Task-related            clinical medicine [34] and social sciences [33].)
experience was measured in years for 13 experiments. Task-             To increase the potential for sampling subjects from a
related training was reported for 12 experiments.                  well-defined population and to alleviate the problem of
   The relatively low and arbitrary reporting on context           having few professionals as subjects (Section 6.1), the
variables is a hindrance for metastudies, which are needed         experimental software engineering community should
to identify which context factors influence which kinds of         apply new incentives, for example, paying companies
performance. The impact of the various context factors will,       directly for the hours spent on an experiment [3] or offer
of course, depend on the treatments and actual tasks to be         the companies tailored, internal courses where the course
                                                                   exercises can be used in experiments [27]. Payment would
performed in the experiments. Future work should
                                                                   require that researchers include expenses for this kind of
investigate the extent to which the variation in performance
                                                                   experiment in their applications to funding bodies, see
of subjects can be explained by their background, such as          further discussion in [39].
education and work experience, and to increase our
knowledge of the impact of using students versus profes-
sionals as subjects in software engineering experiments.           7    TASKS
                                                                   The tasks that subjects are asked to carry out are an
6.3 Recruitment of Subjects
                                                                   important characteristic of a software engineering experi-
Recruiting subjects to experiments is not a trivial task; either   ment. Tasks may include building a software application
from a methodological or a practical point of view. For            from scratch or performing various operations on an
example, volunteers may bias the results because they are          existing application. This section reports on the surveyed
often more motivated, skilled, etc., than subjects who take        experiments according to a high-level categorization of their
part because it is mandatory in some way [8]. Information          tasks and the duration of those tasks. Moreover, we
about whether participation was mandatory is reported for          describe the total magnitude of the experiments by reporting
41 (36 percent) of the experiments. For 12 of them (all            the product of the number of subjects and the duration of
student experiments), participation was mandatory. Infor-          the tasks. Finally, we describe the kinds and size of
mation about subject compensation for taking part in               application and materials used in the experiments.
experiments is reported for 39 (35 percent) of the experi-         7.1 Task Categorization
ments. The grades of students were affected by the
                                                                   We categorize tasks given to subjects according to the main
participation in 10 cases and they received extra credits in
                                                                   tasks in a software process. We have defined four general
nine cases (Table 9). In three cases, students were paid to
                                                                   categories, Plan, Create, Modify, and Analyze that reflect
take part and, in one case, students were sponsored for a
                                                                   major tasks on software artefacts. Table 10 shows sub-
trip to an exhibition. No compensation to professionals is         categories within these major categories that have been
reported. Typically, the experiments with professionals            identified in the surveyed experiments.
were organized as part of normal projects or training
programs and payment was thus implicitly provided by the              5. This claim should, as we understand it, not be interpreted outside the
employer. Hence, it seems that none of the researchers or          context of statistical hypothesis testing. Obviously, even a study without a
                                                                   well-defined population (but with a well-defined sample) may enable the
research teams paid companies or professionals for taking          researcher to infer about similar projects, e.g., based on argumentation by
part in experiments.                                               analogy or by theory, see further discussion in [26].
742                                               IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,   VOL. 31,   NO. 9,   SEPTEMBER 2005


                                                          TABLE 10
                                       Task Categorization, Duration, and Materials Size




    Task categorization is somewhat different from topic          experiments involve individual inspections, 29 involve
categorization. Tasks of a certain category can be used to        team inspections. Twenty-eight experiments involve both
test hypotheses within various topics. For example, a             individual and team inspections. Inspection tasks are
maintenance task can be used to test a certain design or          typically conducted using pen and paper, although some
an experiment assigned to the Patterns category in the IEEE       use support tools.
taxonomy might have design, coding, or maintenance tasks.            Document comprehension tasks form the basis of various
    Table 10 shows the number of experiments deploying            software engineering tasks and typically involve answering
each kind of task. Note that tasks of several categories might    questions about system structure and functionality.
be involved in a single experiment. A task is represented by      Twenty-three experiments involve document comprehen-
its fraction of all tasks in an experiment, for example, an       sion tasks, 12 of these pertain to code documents, nine are
experiment with one Design task and one Coding task gives         design comprehension, one concerns a requirements docu-
a contribution of 0.5 to each of the two task categories. (Note   ment, and one concerns process components.
also that we do not distinguish between tasks and subtasks           Maintenance tasks pertain to all document types. The
because there is no commonly agreed definition of the unit        surveyed experiments, however, only deal with design
of task. Hence, in the relatively few cases in which the          and code maintenance tasks; 19 experiments give code
experimenters have divided their tasks into subtasks, we          maintenance tasks and three give change tasks on design
have considered them as one task as long as they fall within      documents (including impact analyses). None give both.
the same category.) Due to experiment design, a task may          In 10 of the code maintenance tasks, new functionality was
be performed several times by the same subjects but with          added. One of the code maintenance tasks is a pen-and-
different treatments. In such cases, however, the task is only    paper maintenance task performed jointly with a compre-
counted once.                                                     hension task.
    The proportion of planning, creation, modification, and          Coding and Design are tasks in which new artefacts are
analysis tasks is, respectively, 10 percent, 20 percent,          produced. Modifying existing code or design documents is
16 percent, and 54 percent. Inspection tasks occur in 37          classified as maintenance.
(33 percent) of the experiments, and are by far the most             Most of the Testing tasks involve the generation of test
prominent. This is in accordance with the topic classifica-       harnesses and test cases. Testing here also includes
tion of articles reported in Section 5. Thirty-six of these       debugging using debugging tools, but excludes inspections.
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                                              743


                                                                          7.2 Task Duration
                                                                          An important task characteristic is duration. Accurate
                                                                          duration data per subject (typically in dependent variables)
                                                                          is reported in 41 (36 percent) of the experiments and at slot
                                                                          level in 28 (25 percent) of the experiments. (Time slots are
                                                                          coarse-grained indications, typically upper bounds, of how
                                                                          much time the subjects took to perform a task. For example,
                                                                          we chose a slot of two hours from the information that “We
                                                                          gave each subject up to three hours to review each document (i.e.,
                                                                          one document in the morning, and one in the afternoon). Only
                                                                          one subject took more than two hours.”) Duration data that is
Fig. 2. Distribution of experiments with subject-level duration data to   not considered sufficient for analysis is contained in phrases
time intervals.                                                           like “Six days, no time limit,” “From 1 to 11 days depending on
                                                                          the subjects’ skills,” and “Non-programming subjects had 30 min.
   Three experiments investigate the effects of preplanning               to finish their task. Programming subjects had one week”.
estimates on detailed estimates (anchoring). In one of these,                 Fig. 2 shows the frequency by time interval of the
the Estimation task is part of a Project planning exercise.               41 experiments with detailed subject-level time information.
One experiment involves estimation in a larger project,                   It appears that about 2/3 of those experiments last less than
although project planning as such is not a task in the                    two hours.
experiment in question. Two experiments issue estimation                      The two leftmost “Subject level” columns of Table 10
                                                                          show, for each task category respectively, the fraction of
tasks in order to compare various estimation techniques.
   Four of the five experiments with Project planning tasks               and the number of experiments with subject-level duration
                                                                          data that include tasks of this category. The third “Subject
are all-student experiments in which the subjects were
                                                                          level” column shows the median duration in hours for these
asked to role-play in project planning or to simulate
                                                                          tasks. For example, three experiments have duration data
projects. The fifth one involves both professionals and
                                                                          for design tasks. The fraction of the time spent on design
students assessing how 10 different factors affect the lead-
                                                                          activities in these three experiments, relative to the total
time of software development projects.                                    time for all experiments (with subject-level time data), is 2.3.
   In the experiment involving Requirements analysis, the
                                                                          The median time used on design tasks in these experiments
subjects were asked to negotiate a software requirements                  is 0.85 hours. The median duration of all experiments with
meeting with a customer.                                                  subject-level data is 1.0 hours and 2.0 hours for the
   Forty experiments deploy tasks in several categories.
                                                                          experiments with slot-level data.
Among these experiments, five involve three tasks, two                        Fig. 3 shows the actual task duration for the subject-level
involve four tasks: one has comprehension, maintenance,                   occurrences of Table 10. In the interest of saving space, four
and inspection (individual and team) tasks, and one has                   data points at, respectively, 25 (Change code), 18.5 (Coding),
design, coding, team inspection, and testing.                             18.5 (Design), and 55 hours (Coding) are omitted from the




Fig. 3. Task categories and subject-level duration data.
744                                                IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,    VOL. 31,   NO. 9,   SEPTEMBER 2005


                                                           TABLE 11
                   Distribution of Experiments to Subject Number, Duration, and Subject-Duration Categories




figure. It appears that there is large variance in duration and       We here categorize the 69 experiments with duration data
that it seems independent of the type of task being                (41 with subject-level data and 28 with slot-level data),
performed.                                                         according to subject numbers and task duration into,
   Little is mentioned in the articles about control over          respectively, S (small), M (medium), and L (large), such
context variables in experiments with multiple sessions, idle      that each category contains roughly 1/3 of the experiments.
periods, or that span several days or weeks. Although the          In practice, this gives the following categories: For subject
issue in principle concerns all experiments, it would be           numbers, S:  23, M: 23-47, and L:  47. For duration, S: 
particularly interesting to know how experimenters have            0.96 hours, M: 0.96-2.9, and L:  2.9 hours. (The subject
ensured control in experiments that involve long tasks.            groups cannot be made completely even because there are
   The data described above reflects the duration of explicitly
                                                                   six experiments with 24 subjects.) The person-hours cate-
measured software engineering-specific tasks as described in
                                                                   gorization is obtained by crossing these two categorizations
Section 7.1. Often, however, subjects perform additional
                                                                   in configurations (subject category, duration category) as
tasks (training, preparation, postmortem questionnaires,
etc.) whose durations are not captured in dependent                follows: S (small): (S,S), (S,M), (M,S); M (medium): (S,L),
variables or are otherwise measured explicitly. If one wants       (M,M), (L,S); L (large): (M,L), (L,L), (L,M). Table 11 shows
to reflect the total time spent by subjects (perhaps in the        that experiments with professionals use a smaller number of
interest of logistics), information at a different level must be   subjects than do experiments with students. Both experi-
gathered. Although most experiments (close to 80 percent)          ments with students and experiments with professionals
provide some sort of information about total experiment            have a uniform distribution for the three levels of duration.
duration, the data is, in general, measured and reported           Regarding magnitude, most student experiments are in the
arbitrarily and is consequently difficult to summarize here.       middle category and a fair number are large, while most
   The median duration of the tasks of 1.0/2.0 hours is, of        experiments with professionals are small and only one
course, a very small fraction of the time of a typical             experiment is large.
industrial development project. The extent to which short
tasks are a threat to external validity is difficult to judge in   7.4 Application and Materials
general. The actual tasks in the experiments may be                Applications may be of various types, such as commercial,
representative of typical industrial (sub)tasks. However,          open source, the result of a student project, or custom-
the lack of studies that describe “typical” tasks within           built for the purpose of the experiment. Table 12 shows
certain categories and contexts makes such a judgment              that 75 percent of the surveyed experiments involved
difficult. More studies are needed to investigate the              applications that were either constructed for the purpose
relevance of the tasks being conducted in software                 of the experiment or were parts of student projects.
engineering experiments.                                           Commercial applications were used in 16 experiments, of
7.3   Magnitude of Experiments—Combination of                      which 10 included inspection tasks (eight of these had
      Number of Subjects and Duration                              team inspections in addition to individual inspections),
Many aspects of the complexity of software engineering             two included design tasks, one had coding and main-
only manifest themselves in controlled experiments if the          tenance (change code), one had coding only, one had
experiments involve a sufficiently large number of subjects
and tasks, for example, differences among subgroups of
                                                                                             TABLE 12
subjects [3]. Hence, we can characterize the experiments in
                                                                          Distribution of Experiments to Application Type
terms of the scale of the combination of subjects and tasks
(here, measured in terms of duration of the task). The
magnitude of an experiment can be described in terms of
the total number of person-hours or person-days; that is, the
number of subjects multiplied with the length of the tasks.
   In this survey, the experiment with the largest number of
professionals lasts less than one hour. However, in general,
there seems to be no significant relationship between
duration and the number of subjects.
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                                   745


(design) comprehension and maintenance (change design),         two maintenance tasks: 50-150 LOC and 35-79 LOC,
and one had estimation. For 12 experiments, the reporting       respectively.
is unclear in this respect, but 11 of these appear to have
used custom-built applications. There are no open-source        8   ENVIRONMENTS
applications registered in this survey. The small fraction
of commercial or industrial applications used in current        The strength of controlled experiments lies in that they may
software engineering experiments puts in question the           be used to isolate causal relationships. However, controlled
possibility of generalizing the experimental results to an      experiments in the field of software engineering are often
industrial setting.                                             conducted in artificially designed environments that make
   The size of the materials presented to subjects gives some   it difficult to generalize the results to industrial contexts. In
indications of the comprehensiveness of the experiments.        short, “Internal and external validity can be negatively
Size in the form of pages, lines of code (LOC), or other        related” [37]. This section describes the surveyed experi-
quantities is reported in 67 (59 percent) of the experiments.   ments according to their location and tools used.
The diversity of the surveyed experiments and how they
                                                                8.1 Location
report information about materials makes it difficult to give
a systematic overview of the size of the experiment             There is a trade-off between realism and control regarding
materials. Nevertheless, below, we describe in brief the        the location of an experiment. Running an experiment in the
size of materials per task category, cf. the rightmost column   usual office environment of subjects that are professionals
of Table 10.                                                    allows a certain amount of realism, yet increases the threat
   Three experiments with Project planning tasks report         to internal validity due to breaks, phone calls, and other
materials size: a one-page case scenario, a scenario in terms   interruptions. Controlling and monitoring the experiment is
of 500 adjusted function points, and a four-page program        easier in a laboratory set up but, in such a setting, realism
representation, respectively. The three Estimation experi-      suffers.
ments are based on a 1,000 person-hour project, a four-page        For the 27 experiments with professionals or with
program representation, and on the creation of 10 programs      professionals and students, 17 report no explicit informa-
(no sizes on these), respectively.                              tion about the experimental setting. Only one experiment is
   Five of the six experiments with size indications for        reported to have been run in a usual office environment.
Design tasks are on requirements documents (1-2 pages,          The pilot of another experiment was run in an office
six modules). The sixth experiment gives a one-page task        environment, but the main experiment was run in a
description. Materials size is reported in one instance for a   classroom setting in order to increase the internal validity.
coding task (specification given in three tables).              Three other experiments were run in a classroom setting,
   Two experiments with Change design tasks have a 30 page      two of which were run as part of a training course. Seven
design document as material (one experiment is an internal      experiments are explicitly reported to have been run in a
replication of the other; the materials of the former are       laboratory environment.
improved, but their sizes are more or less the same), and          Of the 85 experiments with students or with students
one has a 1,500 LOC system as input. In the experiments
                                                                and scientists, 56 report no explicit information about the
with Change code tasks, the applications to be maintained
                                                                experimental setting. For, respectively, 13 and seven of
range from 54 to 2,700 LOC. The largest application also
                                                                those experiments, it was explicitly stated that they were
involves 100 pages of documentation. Three experiments
                                                                conducted in a laboratory and classroom. For another group
report the number of classes (6-10).
   Twenty-eight experiments give materials size for Inspec-     of nine experiments, some sort of university setting was
tion tasks (individual or team). Fourteen give LOC (ranging     stated, for example, “academic settings,” “conducted under
from 135-3,955 LOC). In one instance, the materials size is     exam conditions,” and “supervised setting.” However, one
given as 300 LOC, but the size of the entire system is          may assume that all the experiments with students were
65,000 LOC. Page counts (ranging from 16-47 pages) are          carried out in a laboratory or classroom. Moreover, we
given in 14 instances (all different from the 14 with LOC).     believe that the distinction between a classroom and
Materials size for Testing tasks (25-2,000 LOC) is reported     laboratory setting for students may be blurred and may
in seven experiments (one also reports 10 classes). Re-         depend on cultural differences, apart from the fact that a
ported materials sizes for Document comprehension tasks are     laboratory usually would include the use of PCs or
varied (five diagrams, seven screens, 16 screenshots, etc.),    workstations (2/3 of the experiments that report the use
but five experiments give LOC (92-2700 LOC) for Code            of a laboratory also report the use of PC or workstation, see
comprehension tasks and five experiments give page counts       the next section).
(2-30 pages) for Design comprehension tasks.                       Approximately half of all the experiments with students
   In addition, some experiments (with tasks in the Create      report the name of the actual university/college. For the
category) report the size of produced task solutions.           27 experiments that include professionals, the name of the
Five experiments with Coding give LOC (in the range 86-         company is reported in 12 cases. For four experiments, the
2,000 LOC) for produced code, and in one experiment the         company is not named, but the type of company is
size for a Design task is provided implicitly, in that the      specified. Note that, depending on the actual experiment,
solution design document written by the experimenters is        certain companies have a policy such that they must remain
two pages. Also, the amount of added code is given for          anonymous in the reporting of the experiment. In five cases,
746                                                        IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,         VOL. 31,   NO. 9,   SEPTEMBER 2005


                            TABLE 13                                            In general, increasing the realism of software engineering
          Distribution of Experiments to Specified Tool                      experiments entails an increased use of industrial support-
                                                                             ing tools. The community should thus recognize the effort
                                                                             and resources needed to set up PC or workstation
                                                                             environments with the right licenses, installations, access
                                                                             rights, etc., and to familiarize the subjects with the tools.
                                                                             Moreover, the tools must be checked to demonstrate
                                                                             acceptable performance and stability when many subjects
                                                                             are working simultaneously.
                                                                                In the experiments of this survey, there is almost no
the professionals are described as coming from “several”                     discussion of the relationships among the three dimensions
                                                                             subject, task, and environment. For the community to
companies or organizations. The exact number of compa-
                                                                             progress, this issue needs to be investigated. For example,
nies is not given.6
                                                                             a professional development tool will probably become more
8.2 Tools                                                                    useful the larger and more complex the tasks and
It is a challenge to configure the experimental environment                  application systems become, assuming that the subjects
with an infrastructure with supporting tools that resembles                  are sufficiently proficient with the tool.
an industrial development environment. Among the sur-
veyed experiments, 55 percent report on tools to support                     9   REPLICATION
the tasks of the experiments (Table 13). This includes both
                                                                             In this survey, 20 of the experiments are described by the
explicit descriptions, e.g., “Sun-4, GNU C compiler,” and                    authors themselves as replications. These experiments
implicit, but clear indications, e.g., “Developed programs were              constitute 14 series of replications. Table 14 summarizes
run against a set of test data.” Table 13 shows that the use of              the series including both the original experiments and the
computer tools is slightly higher than the use of pen and                    replications and reports differences between them. Most
paper. However, it is likely that a larger proportion of those               replications (35 percent) are conducted in the area of
experiments that do not report on tools are actually pen and                 Inspection (seven replications in series 1, 2, and 3) and
paper experiments, because the added effort and adminis-                     Maintenance (five replications in series 4, 5, and 6). Among
trative overhead of using computer tools might inspire                       the 20 replications, five can be considered as close replica-
researchers to report the use of tools more than the use of                  tions in the terminology of Lindsay and Ehrenberg [31], i.e.,
pen and paper.                                                               one attempts to retain, as much as is possible, most of the
    The task types that give the largest and smallest                        known conditions of the original experiment. The other
contribution to the PC or workstation category are, respec-                  replications are considered to be differentiated replications,
tively, Coding and Inspection. Other than that, there is little              i.e., they involve variations in essential aspects of the
correlation between task type and tool for the experiments                   experimental conditions. One prominent variation involves
that actually report on this issue. Moreover, there was no                   conducting the experiment with other kinds of subject;
difference between experiments with professionals and                        three replications use professionals instead of students,
experiments with students regarding the use of tools.                        three use undergraduates instead of graduates, and one
    Three of the five experiments with Combination in                        uses students instead of professionals. Other variations
Table 13 explicitly test the effects of computerized tool                    include conducting the experiment on different application
use versus pen and paper.                                                    systems (four) and with different tasks (three).
    The relatively meager proportion of experiments that                         In all the five close replications, the results of the original
report on the use of tools to support assigned tasks may be                  experiment were confirmed (three were conducted by the
due to an unawareness of, or a lack of interest in, the                      same authors, two by others). Among the 15 differentiated
relevance of this issue. For example, most of the experi-                    replications, seven were conducted by other authors. Six of
ments in the Unknown category are inspection experiments,                    these reported results differing from the original experi-
for which it may be normal to use pen and paper. However,                    ment and one partly confirmed the results of the original
for most design, coding, testing, and maintenance tasks,                     experiment. Among the differentiated replications con-
computer tools would have been used in an industrial                         ducted by the original authors, we found the opposite
setting, although the line is not clear-cut. For example,                    pattern; seven replications confirmed the results of the
                                                                             original experiment and only one reported partly different
designers may sketch preliminary versions by hand, but the
                                                                             results.
final design would be made using a tool.
                                                                                 “Methodological authorities generally regard replication,
   6. Before we decided to rely exclusively on the information reported      or what is also referred to as “repeating a study,” to be a
in the articles, we approached the corresponding authors of these            crucial aspect of the scientific method” [31]. However, only
five experiments to acquire more information about the extent of companies
involved in the experiments. It turned out that in two experiments, the
                                                                             18 percent of the surveyed experiments were replications. A
subjects attended a course aimed at people from industry (the number of      discussion of the form and extent of replication that would
companies of the participants was unknown). One author replied that it was   benefit software engineering is beyond the scope of this
a mistake in the article; all participants actually came from the same
company. One replied that he did not know, but our impression was that it    paper, but should be an issue of discussion for the research
was only two companies. The last one did not respond to our request.         community.
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                                    747


                                                            TABLE 14
                                                      Replicated Experiments




10 THREATS      TO INTERNAL AND         EXTERNAL VALIDITY              Threats to internal validity are addressed explicitly for
                                                                    71 experiments (63 percent). (We did not include threats
Two important aspects of the quality of an experiment are
their internal and external validity. This section discusses        that are addressed implicitly as part of the experimental
how, and the extent to which, threats to internal and               design.) We classified the reporting of threats to internal
external validity are reported for the surveyed experiments.        validity according to the scheme of Shadish et al. [37]
Descriptions of such threats are made in various ways and           shown in Table 15. That table also shows examples of
under different headings. For 54 experiments (48 percent of         threats in the various categories reported for the surveyed
all experiments), there is a special section entitled “Threats      experiments. A version of this scheme, with other examples
to (internal/external) validity” or other combinations that         from software engineering, is presented in [46].
include the terms “threats” or “validity.” Nine other                  Table 16 shows the distribution of the experiments
experiments (8 percent) have special sections on threats to         according to the scheme of Table 15. Almost half of all
validity but with other names (e.g., “Limitations to the            experiments report on selection threats (46 percent) and/or
results”). The reporting of threats to validity in yet another      instrumentation threats (40 percent). The distribution of
eight experiments were found in other sections.                     number of threat categories reported is as follows: 22 experi-
                                                                    ments report one threat, 11 experiments report two threats,
10.1 Internal Validity                                              23 report three, 10 report four, four report five, and
Internal validity of an experiment is “the validity of inferences   one experiment reports seven threats.
about whether observed co-variation between A (the presumed            Table 16 also shows whether the authors consider the
treatment) and B (the presumed outcome) reflects a causal
                                                                    threats to be present but not handled by the authors, or
relationship from A to B as those variables were manipulated or
                                                                    reduced or eliminated due to actions taken by the authors
measured” [37]. Changes in B may have causes other than, or
alternative to, the manipulation of A. Such an alternative          or due to other circumstances. Only 18 percent of the threats
cause for the outcome is called a confound or confounding           (reported in 23 percent of the experiments) are not handled,
factor. For further discussions (including formal definitions)      but it may be the case that threats that are not reduced or
of concepts of confounding, see [21].                               eliminated are underreported.
748                                               IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,              VOL. 31,    NO. 9,   SEPTEMBER 2005


                                                           TABLE 15
 Threats to Internal Validity: Reasons Why Inferences that the Relationship between Two Variables Is Causal May Be Incorrect




   Classifying internal validity is not always straightfor-       variation. For example, the notions of statistical conclusion
ward. For example, “learning effects” are often classified as     validity and construct validity appeared for the first time in
“maturation” in the experiments, while this should be             1979 [11].7
“testing” according to the categories given in Shadish et al.
                                                                  10.2 External Validity
[37]. Maturation threats refer to ”natural changes that would
                                                                  External validity concerns inferences about the extent to
occur even in the absence of treatment, such as growing older,
                                                                  which a causal relationship holds over variations in
hungrier, wiser, stronger, or more experienced,” while testing    persons, settings, treatments, and outcomes [37]. This
threats refer to effects of practice and familiarity within the   section summarizes how the authors report threats to
experiment that could be mistaken for treatment effects [37].     external validity regarding these issues.
Moreover, threats that by this scheme pertain to statistical         Threats to external validity are reported for 78 experi-
conclusion validity or construct validity were, for a few         ments (69 percent). Table 17 shows a categorization of the
experiments, reported as internal validity threats. In part,
                                                                     7. In addition, there are threats that scholars put in different main
this may be due to nontrivial subtleties in threat classifica-    categories. For example, what Trochim [44] and Wohlin et al. [46] refer to as
tion, illustrated by the fact that the line of development        “social threats” are categorized as threats to internal validity by them, but as
                                                                  threats to construct validity by Shadish et al. [37]. Four experiments address
starting with Campbell et al. [9], via Cook et al. [11] to the    “social threats” in our survey, but since we follow the scheme of Shadish et al.,
present classification scheme in [37], shows considerable         such threats are not included in our survey.
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                                 749


                                                         TABLE 16
                                                 Threats to Internal Validity




threats based on Shadish et al. [37]. Threats regarding            Most of the task-related threats concern size and
subjects are discussed in a total of 67 experiments (rows       complexity of the tasks (16 experiments) and experimental
one, four, five, six, and seven), regarding task in 60,         material (34), such as program code, inspection documents,
environment in 23, and treatment in six.                        and database systems. For experiments on inspection, one
   Most threats regarding subjects deal with difficulties of    threat discussed was that the inspection process applied
generalizing from students to professionals (45 experi-         was not considered representative for industrial practice
ments). Another category of experiments (14) also uses          (nine). The (short) duration of the experiment was also
students, but the experimenters argue that this may not be a    regarded as a threat (three). One experiment stated that “all
threat to validity because the students for this kind of task   our results were obtained from one project, in one application
would (probably) have the same ability as professionals         domain, using one language and environment, within one
(seven), because the students were close to finalizing their    software organization. Therefore, we cannot claim that our
education and starting work in the industry (four), or          conclusions have general applicability, until our work has been
because one other study showed no difference between            replicated.” Another experiment stated that the subjects
students and professionals (three). A few experiments           might not have used the technique intended to be studied in
(three) that use professionals claim that threats to external   the experiment.
validity were not critical because the experiment was              Threats regarding environment were either stated as a
conducted with professionals (they did not discuss the          problem of generalizing from the experimental setting with
representation of their actual sample of professionals). A      no specific reasons (five experiments) or stated with
few experiments (three) considered that running an experi-      concrete reasons for the difficulties: use of laboratory or
ment within a single organization was a threat to the           classroom (nine), individual work (five), and use of pen and
generalization to other organizations. Very few experiments     paper (six).
(two) explicitly described the lack of random sampling as a        A major finding is that the reporting is vague and
threat to validity.                                             unsystematic. The community needs guidelines that


                                                         TABLE 17
                                                 Threats to External Validity
750                                               IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,     VOL. 31,   NO. 9,   SEPTEMBER 2005


provide significant support for how to draw conclusions           report three experiments. We detected one case of near
from the experimental results and on how to address               article duplicates in different journals. The structure of the
threats to internal and external validity and their con-          database is designed to handle duplication, but a threat
sequences.                                                        would be that duplication goes undetected. However, at
                                                                  least three people have read through all relevant articles
11 THREATS      TO THE    VALIDITY   OF THIS   SURVEY             without detecting further duplicates.
The main threats to validity for this study are publica-          11.3 Data Extraction
tion selection bias, inaccuracy in data extraction, and           The data was extracted from the articles independently by
misclassification.
                                                                  two researchers. The interrater agreement varied from
11.1 Selection of Journals and Conferences                        73 percent to 100 percent. Disagreements were resolved
We consider the 12 surveyed journals and conferences to be        by discussion and, when necessary, by involving other
leaders in software engineering in general and empirical          project members. Data extraction from prose is difficult at
software engineering in particular. (Our selection of             the outset and the lack of standard terminology and
journals is a superset of those selected by others, as shown      standards for reporting experiments in software engineer-
in Table 1.) Nevertheless, a systematic survey that included,     ing may have resulted in some inaccuracy in the data.
in addition, gray literature (theses, technical reports, work-
ing papers, etc.) describing controlled experiments in            11.4 Classification to Topics
software engineering would, in principle, provide more            The classification of articles to topics was done in two steps.
data and allow more general conclusions to be drawn [29].         First, the articles were classified automatically on the basis
                                                                  of title, list of keywords, and registered treatment. Then,
11.2 Selection of Articles
                                                                  this classification was double-checked by two researchers.
To help ensure an unbiased selection process, we defined          The interrater agreement between the algorithm and the
research questions in advance, organized the selection of
                                                                  two researchers was 75 percent for the comparative
articles as a multistage process, involved several researchers
                                                                  classification using Glass et al.’s scheme and 66 percent
in this process, and documented the reasons for inclusion/
                                                                  for the IEEE-classification. The topic classification was
exclusion as suggested in [29].
                                                                  difficult due to the lack of a well-defined method of
    The initial investigation of the titles and abstracts of
                                                                  classifying according to the schemes used.
5,453 articles resulted in 140 survey articles. Based on
recorded comments, 80 of these were reanalyzed by one or
two other researchers and discussed in the project group.         12 SUMMARY
Seventeen further articles were then excluded because they
                                                                  This paper reported a survey that characterized quantita-
described studies without a treatment. Moreover, three
                                                                  tively the controlled experiments in software engineering
articles were found to be exploratory, observational, or
constituting a prestudy. Eight were found to fall outside the     published in nine journals and three conference proceed-
field of software engineering, five were excluded on the          ings in the decade from 1993 to 2002. Included were
grounds that they were summary articles, while four               randomized experiments or quasi-experiments in which
articles described multiple case studies. We used Inspec          individuals or teams (the experimental units) applied a
and various search engines to check the completeness of our       process, method, technique, language, or tool (the treat-
inclusion and cross-checked for inclusion with other              ments) to conduct one or more software engineering tasks.
surveys [51], [22], [25]. Still, the process was difficult and    Out of 5,453 articles scanned, we identified 103 articles that
we may not have managed to detect all articles that we            reported 113 controlled experiments.
would have liked to include.                                         Although as many as 108 institutions from 19 countries
    Another challenge was that there is no keyword                were involved in conducting the experiments, a relatively
standard that we are aware of that distinguishes between          low proportion of software engineering articles (1.9 percent)
methods in empirical software engineering and that could
                                                                  report controlled experiments, given that controlled experi-
be used to extract controlled experiments in a consistent
                                                                  ments is the classical scientific method for identifying cause-
manner. For example, none of the selected articles matched
the IEEE keyword taxonomy; indeed, this taxonomy has no           effect relationships. One reason may be the large effort and
appropriate keywords for the methods of empirical soft-           resources needed to run well-designed experiments.
ware engineering. (MIS Quarterly has ceased to use their             An important issue that pertains to all software engineer-
keyword classification scheme due to the presence of full-        ing research is its relevance to the industry. For experiments,
text search engines and the difficulty of keeping keyword         both the topics under investigation and how representative
classification schemes up to date [45].)                          of an industrial setting an experiment is will influence
    Moreover, article and experiment duplication is a             industrial relevance. The two major areas investigated in the
potential threat to frequency counts and the statistics in        experiments were inspection techniques and object-oriented
this survey. Among the 113 experiments covered in the             design techniques. This survey also gave some indications
103 articles, 109 are reported in one article, two are reported   as to how realistic the experiments were relative to the
in two articles, one is reported in three articles, and one is    subjects that took part, the tasks they performed, the types of
reported in four. Among the 103 surveyed articles, 91 report      applications on which these tasks were done, and the
a single experiment, seven report two experiments, and five       environment in which the subjects worked.
SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING                                               751


                                                       TABLE 18
                                           Total Number of Articles Investigated




   In total, 5,488 subjects participated in the experiments.   experiments and regarding treatment in only a few. Threats
The number of participants ranged from 4 to 266 with a         to internal validity regarding selection and instrumentation
mean value of 49. In total, 87 percent of the subjects were    were most frequently reported.
students, whereas only 9 percent were professionals. This         A major finding of this survey is that the reporting is
indicates that one may question how representative the         often vague and unsystematic and there is often a lack of
experimental results are for an industrial setting.            consistent terminology. The community needs guidelines
   The same applies to the kind of application used in the     that provide significant support on how to deal with the
experiments. In 75 percent, the applications were con-         methodological and practical complexity of conducting and
structed for the purpose of the experiment or constituted      reporting high-quality, preferably realistic, software engi-
student projects. Commercial applications were used in         neering experiments. We recommend that researchers
14 percent of the experiments.                                 should accurately report the following: the type and
   Threats to internal and external validity were addressed    number of subjects, including the mortality rate; context
in, respectively, 63 percent and 69 percent of the experi-     variables such as general software engineering experience
ments. Among the threats to internal validity, about 1/5       and experience specific to the tasks of the experiments; how
were not handled, 3/5 were reduced, and 1/5 were               the subjects were recruited; the application areas and type
eliminated. This could either mean that the experiments        of tasks; the duration of the tasks; and internal and external
all over had a high degree of internal validity or that the    validity of the experiments, including being specific about
internal threats that were not reduced or eliminated were      the sample and target population of the experiment. A more
underreported. Threats to external validity regarding          uniform way of reporting experiments will help to improve
subject and task were discussed in more than half of the       the review of articles, replication of experiments, meta-
experiments, regarding environment in about 1/4 of the         analysis, and theory building.


                                                         TABLE 19
                                   Number of Articles that Report Controlled Experiments
A survey of controlled experiments in software engineering
A survey of controlled experiments in software engineering

Contenu connexe

Tendances

Software Defect Prediction Using Radial Basis and Probabilistic Neural Networks
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksSoftware Defect Prediction Using Radial Basis and Probabilistic Neural Networks
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksEditor IJCATR
 
Digital Security by Design: Formal Verification with Broad-Spectrum ANSI-C Re...
Digital Security by Design: Formal Verification with Broad-Spectrum ANSI-C Re...Digital Security by Design: Formal Verification with Broad-Spectrum ANSI-C Re...
Digital Security by Design: Formal Verification with Broad-Spectrum ANSI-C Re...KTN
 
Maturing software engineering knowledge through classifications
Maturing software engineering knowledge through classificationsMaturing software engineering knowledge through classifications
Maturing software engineering knowledge through classificationsimmortalchhetri
 
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET Journal
 
STATE OF THE ART SURVEY ON DSPL SECURITY CHALLENGES
STATE OF THE ART SURVEY ON DSPL SECURITY CHALLENGESSTATE OF THE ART SURVEY ON DSPL SECURITY CHALLENGES
STATE OF THE ART SURVEY ON DSPL SECURITY CHALLENGESIJCSES Journal
 
Verification and validation of knowledge bases using test cases generated by ...
Verification and validation of knowledge bases using test cases generated by ...Verification and validation of knowledge bases using test cases generated by ...
Verification and validation of knowledge bases using test cases generated by ...Waqas Tariq
 
3 IGARSS2011_Pasolli_Final.pptx
3 IGARSS2011_Pasolli_Final.pptx3 IGARSS2011_Pasolli_Final.pptx
3 IGARSS2011_Pasolli_Final.pptxgrssieee
 
Kostogryzov 10.12.2009
Kostogryzov 10.12.2009Kostogryzov 10.12.2009
Kostogryzov 10.12.2009Mathmodels Net
 
LAS - System Biology Lesson
LAS - System Biology LessonLAS - System Biology Lesson
LAS - System Biology LessonLASircc
 
Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...
Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...
Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...CSCJournals
 
ADDRESSING IMBALANCED CLASSES PROBLEM OF INTRUSION DETECTION SYSTEM USING WEI...
ADDRESSING IMBALANCED CLASSES PROBLEM OF INTRUSION DETECTION SYSTEM USING WEI...ADDRESSING IMBALANCED CLASSES PROBLEM OF INTRUSION DETECTION SYSTEM USING WEI...
ADDRESSING IMBALANCED CLASSES PROBLEM OF INTRUSION DETECTION SYSTEM USING WEI...IJCNCJournal
 
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...ijaia
 
Informatics Plenary Hayward Etal May2012
Informatics Plenary Hayward Etal May2012Informatics Plenary Hayward Etal May2012
Informatics Plenary Hayward Etal May2012Mark Hayward
 
A Study on MDE Approaches for Engineering Wireless Sensor Networks
A Study on MDE Approaches  for Engineering Wireless Sensor Networks A Study on MDE Approaches  for Engineering Wireless Sensor Networks
A Study on MDE Approaches for Engineering Wireless Sensor Networks Ivano Malavolta
 
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...SBGC
 

Tendances (19)

Software Defect Prediction Using Radial Basis and Probabilistic Neural Networks
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksSoftware Defect Prediction Using Radial Basis and Probabilistic Neural Networks
Software Defect Prediction Using Radial Basis and Probabilistic Neural Networks
 
Digital Security by Design: Formal Verification with Broad-Spectrum ANSI-C Re...
Digital Security by Design: Formal Verification with Broad-Spectrum ANSI-C Re...Digital Security by Design: Formal Verification with Broad-Spectrum ANSI-C Re...
Digital Security by Design: Formal Verification with Broad-Spectrum ANSI-C Re...
 
Maturing software engineering knowledge through classifications
Maturing software engineering knowledge through classificationsMaturing software engineering knowledge through classifications
Maturing software engineering knowledge through classifications
 
Kost for china-2011
Kost for china-2011Kost for china-2011
Kost for china-2011
 
GiacomettiResume
GiacomettiResumeGiacomettiResume
GiacomettiResume
 
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
 
STATE OF THE ART SURVEY ON DSPL SECURITY CHALLENGES
STATE OF THE ART SURVEY ON DSPL SECURITY CHALLENGESSTATE OF THE ART SURVEY ON DSPL SECURITY CHALLENGES
STATE OF THE ART SURVEY ON DSPL SECURITY CHALLENGES
 
Zander diss
Zander dissZander diss
Zander diss
 
Verification and validation of knowledge bases using test cases generated by ...
Verification and validation of knowledge bases using test cases generated by ...Verification and validation of knowledge bases using test cases generated by ...
Verification and validation of knowledge bases using test cases generated by ...
 
3 IGARSS2011_Pasolli_Final.pptx
3 IGARSS2011_Pasolli_Final.pptx3 IGARSS2011_Pasolli_Final.pptx
3 IGARSS2011_Pasolli_Final.pptx
 
Kostogryzov 10.12.2009
Kostogryzov 10.12.2009Kostogryzov 10.12.2009
Kostogryzov 10.12.2009
 
LAS - System Biology Lesson
LAS - System Biology LessonLAS - System Biology Lesson
LAS - System Biology Lesson
 
Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...
Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...
Comprehensive Testing Tool for Automatic Test Suite Generation, Prioritizatio...
 
ADDRESSING IMBALANCED CLASSES PROBLEM OF INTRUSION DETECTION SYSTEM USING WEI...
ADDRESSING IMBALANCED CLASSES PROBLEM OF INTRUSION DETECTION SYSTEM USING WEI...ADDRESSING IMBALANCED CLASSES PROBLEM OF INTRUSION DETECTION SYSTEM USING WEI...
ADDRESSING IMBALANCED CLASSES PROBLEM OF INTRUSION DETECTION SYSTEM USING WEI...
 
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
SOFTWARE TESTING: ISSUES AND CHALLENGES OF ARTIFICIAL INTELLIGENCE & MACHINE ...
 
Informatics Plenary Hayward Etal May2012
Informatics Plenary Hayward Etal May2012Informatics Plenary Hayward Etal May2012
Informatics Plenary Hayward Etal May2012
 
A Study on MDE Approaches for Engineering Wireless Sensor Networks
A Study on MDE Approaches  for Engineering Wireless Sensor Networks A Study on MDE Approaches  for Engineering Wireless Sensor Networks
A Study on MDE Approaches for Engineering Wireless Sensor Networks
 
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
 
AS_industrialNew
AS_industrialNewAS_industrialNew
AS_industrialNew
 

En vedette

Software Engineering Large Practical coursework
Software Engineering Large Practical courseworkSoftware Engineering Large Practical coursework
Software Engineering Large Practical courseworkStephen Gilmore
 
Intelligence Agent - Artificial Intelligent (AI)
Intelligence Agent - Artificial Intelligent (AI)Intelligence Agent - Artificial Intelligent (AI)
Intelligence Agent - Artificial Intelligent (AI)mufassirin
 
software engineering notes for cse/it fifth semester
software engineering notes for cse/it fifth semestersoftware engineering notes for cse/it fifth semester
software engineering notes for cse/it fifth semesterrajesh199155
 
Lecture1 AI1 Introduction to artificial intelligence
Lecture1 AI1 Introduction to artificial intelligenceLecture1 AI1 Introduction to artificial intelligence
Lecture1 AI1 Introduction to artificial intelligenceAlbert Orriols-Puig
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligenceu053675
 
Software Engineering Lab Manual
Software Engineering Lab ManualSoftware Engineering Lab Manual
Software Engineering Lab ManualNeelamani Samal
 
Introduction To Software Engineering
Introduction To Software EngineeringIntroduction To Software Engineering
Introduction To Software EngineeringLeyla Bonilla
 

En vedette (7)

Software Engineering Large Practical coursework
Software Engineering Large Practical courseworkSoftware Engineering Large Practical coursework
Software Engineering Large Practical coursework
 
Intelligence Agent - Artificial Intelligent (AI)
Intelligence Agent - Artificial Intelligent (AI)Intelligence Agent - Artificial Intelligent (AI)
Intelligence Agent - Artificial Intelligent (AI)
 
software engineering notes for cse/it fifth semester
software engineering notes for cse/it fifth semestersoftware engineering notes for cse/it fifth semester
software engineering notes for cse/it fifth semester
 
Lecture1 AI1 Introduction to artificial intelligence
Lecture1 AI1 Introduction to artificial intelligenceLecture1 AI1 Introduction to artificial intelligence
Lecture1 AI1 Introduction to artificial intelligence
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Software Engineering Lab Manual
Software Engineering Lab ManualSoftware Engineering Lab Manual
Software Engineering Lab Manual
 
Introduction To Software Engineering
Introduction To Software EngineeringIntroduction To Software Engineering
Introduction To Software Engineering
 

Similaire à A survey of controlled experiments in software engineering

Carol Harstad Research Proposal
Carol Harstad   Research ProposalCarol Harstad   Research Proposal
Carol Harstad Research ProposalCarol Harstad
 
Empirical research methods for software engineering
Empirical research methods for software engineeringEmpirical research methods for software engineering
Empirical research methods for software engineeringsarfraznawaz
 
Importance of Testing in SDLC
Importance of Testing in SDLCImportance of Testing in SDLC
Importance of Testing in SDLCIJEACS
 
White paper loren k schwappach
White paper   loren k schwappachWhite paper   loren k schwappach
White paper loren k schwappachLoren Schwappach
 
Software engineering ontology and software testing
Software engineering ontology and software testingSoftware engineering ontology and software testing
Software engineering ontology and software testingKhushbu Patel
 
Software Engineering Ontology and Software Testing
Software Engineering Ontology and Software Testing�Software Engineering Ontology and Software Testing�
Software Engineering Ontology and Software TestingKamal Patel
 
The Impact of Software Complexity on Cost and Quality - A Comparative Analysi...
The Impact of Software Complexity on Cost and Quality - A Comparative Analysi...The Impact of Software Complexity on Cost and Quality - A Comparative Analysi...
The Impact of Software Complexity on Cost and Quality - A Comparative Analysi...ijseajournal
 
ANALYZABILITY METRIC FOR MAINTAINABILITY OF OBJECT ORIENTED SOFTWARE SYSTEM
ANALYZABILITY METRIC FOR MAINTAINABILITY OF OBJECT ORIENTED SOFTWARE SYSTEMANALYZABILITY METRIC FOR MAINTAINABILITY OF OBJECT ORIENTED SOFTWARE SYSTEM
ANALYZABILITY METRIC FOR MAINTAINABILITY OF OBJECT ORIENTED SOFTWARE SYSTEMIAEME Publication
 
Case Study Based Software Engineering Project Development: State of Art
Case Study Based Software Engineering Project Development: State of ArtCase Study Based Software Engineering Project Development: State of Art
Case Study Based Software Engineering Project Development: State of ArtDr Sukhpal Singh Gill
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
A REVIEW OF SECURITY INTEGRATION TECHNIQUE IN AGILE SOFTWARE DEVELOPMENT
A REVIEW OF SECURITY INTEGRATION TECHNIQUE IN AGILE SOFTWARE DEVELOPMENTA REVIEW OF SECURITY INTEGRATION TECHNIQUE IN AGILE SOFTWARE DEVELOPMENT
A REVIEW OF SECURITY INTEGRATION TECHNIQUE IN AGILE SOFTWARE DEVELOPMENTijseajournal
 
Introduction to Software Engineering
Introduction to Software EngineeringIntroduction to Software Engineering
Introduction to Software EngineeringRizwan411667
 
Software Systems Requirements Engineering
Software Systems Requirements EngineeringSoftware Systems Requirements Engineering
Software Systems Requirements EngineeringKristen Wilson
 
Research issues in object oriented software testing
Research issues in object oriented software testingResearch issues in object oriented software testing
Research issues in object oriented software testingAnshul Vinayak
 

Similaire à A survey of controlled experiments in software engineering (20)

Thesis Proposal Presentation
Thesis Proposal PresentationThesis Proposal Presentation
Thesis Proposal Presentation
 
Technical report jada
Technical report jadaTechnical report jada
Technical report jada
 
Ijetcas14 438
Ijetcas14 438Ijetcas14 438
Ijetcas14 438
 
Carol Harstad Research Proposal
Carol Harstad   Research ProposalCarol Harstad   Research Proposal
Carol Harstad Research Proposal
 
Empirical research methods for software engineering
Empirical research methods for software engineeringEmpirical research methods for software engineering
Empirical research methods for software engineering
 
Importance of Testing in SDLC
Importance of Testing in SDLCImportance of Testing in SDLC
Importance of Testing in SDLC
 
White paper loren k schwappach
White paper   loren k schwappachWhite paper   loren k schwappach
White paper loren k schwappach
 
Software engineering ontology and software testing
Software engineering ontology and software testingSoftware engineering ontology and software testing
Software engineering ontology and software testing
 
Software Engineering Ontology and Software Testing
Software Engineering Ontology and Software Testing�Software Engineering Ontology and Software Testing�
Software Engineering Ontology and Software Testing
 
The Impact of Software Complexity on Cost and Quality - A Comparative Analysi...
The Impact of Software Complexity on Cost and Quality - A Comparative Analysi...The Impact of Software Complexity on Cost and Quality - A Comparative Analysi...
The Impact of Software Complexity on Cost and Quality - A Comparative Analysi...
 
ANALYZABILITY METRIC FOR MAINTAINABILITY OF OBJECT ORIENTED SOFTWARE SYSTEM
ANALYZABILITY METRIC FOR MAINTAINABILITY OF OBJECT ORIENTED SOFTWARE SYSTEMANALYZABILITY METRIC FOR MAINTAINABILITY OF OBJECT ORIENTED SOFTWARE SYSTEM
ANALYZABILITY METRIC FOR MAINTAINABILITY OF OBJECT ORIENTED SOFTWARE SYSTEM
 
Case Study Based Software Engineering Project Development: State of Art
Case Study Based Software Engineering Project Development: State of ArtCase Study Based Software Engineering Project Development: State of Art
Case Study Based Software Engineering Project Development: State of Art
 
Paper review
Paper reviewPaper review
Paper review
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
H1803044651
H1803044651H1803044651
H1803044651
 
A REVIEW OF SECURITY INTEGRATION TECHNIQUE IN AGILE SOFTWARE DEVELOPMENT
A REVIEW OF SECURITY INTEGRATION TECHNIQUE IN AGILE SOFTWARE DEVELOPMENTA REVIEW OF SECURITY INTEGRATION TECHNIQUE IN AGILE SOFTWARE DEVELOPMENT
A REVIEW OF SECURITY INTEGRATION TECHNIQUE IN AGILE SOFTWARE DEVELOPMENT
 
Introduction to Software Engineering
Introduction to Software EngineeringIntroduction to Software Engineering
Introduction to Software Engineering
 
Software Systems Requirements Engineering
Software Systems Requirements EngineeringSoftware Systems Requirements Engineering
Software Systems Requirements Engineering
 
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATIONONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
 
Research issues in object oriented software testing
Research issues in object oriented software testingResearch issues in object oriented software testing
Research issues in object oriented software testing
 

Plus de JULIO GONZALEZ SANZ

Cmmi hm 2008 sepg model changes for high maturity 1v01[1]
Cmmi hm 2008 sepg model changes for high maturity  1v01[1]Cmmi hm 2008 sepg model changes for high maturity  1v01[1]
Cmmi hm 2008 sepg model changes for high maturity 1v01[1]JULIO GONZALEZ SANZ
 
Cmmi%20 model%20changes%20for%20high%20maturity%20v01[1]
Cmmi%20 model%20changes%20for%20high%20maturity%20v01[1]Cmmi%20 model%20changes%20for%20high%20maturity%20v01[1]
Cmmi%20 model%20changes%20for%20high%20maturity%20v01[1]JULIO GONZALEZ SANZ
 
Introduction to bayesian_networks[1]
Introduction to bayesian_networks[1]Introduction to bayesian_networks[1]
Introduction to bayesian_networks[1]JULIO GONZALEZ SANZ
 
Workshop healthy ingredients ppm[1]
Workshop healthy ingredients ppm[1]Workshop healthy ingredients ppm[1]
Workshop healthy ingredients ppm[1]JULIO GONZALEZ SANZ
 
The need for a balanced measurement system
The need for a balanced measurement systemThe need for a balanced measurement system
The need for a balanced measurement systemJULIO GONZALEZ SANZ
 
Just in-time and lean production
Just in-time and lean productionJust in-time and lean production
Just in-time and lean productionJULIO GONZALEZ SANZ
 
History of manufacturing systems and lean thinking enfr
History of manufacturing systems and lean thinking enfrHistory of manufacturing systems and lean thinking enfr
History of manufacturing systems and lean thinking enfrJULIO GONZALEZ SANZ
 
Une 66175 presentacion norma 2006 por julio
Une 66175 presentacion norma 2006 por julioUne 66175 presentacion norma 2006 por julio
Une 66175 presentacion norma 2006 por julioJULIO GONZALEZ SANZ
 
An architecture for data quality
An architecture for data qualityAn architecture for data quality
An architecture for data qualityJULIO GONZALEZ SANZ
 
Sap analytics creating smart business processes
Sap analytics   creating smart business processesSap analytics   creating smart business processes
Sap analytics creating smart business processesJULIO GONZALEZ SANZ
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 

Plus de JULIO GONZALEZ SANZ (20)

Cmmi hm 2008 sepg model changes for high maturity 1v01[1]
Cmmi hm 2008 sepg model changes for high maturity  1v01[1]Cmmi hm 2008 sepg model changes for high maturity  1v01[1]
Cmmi hm 2008 sepg model changes for high maturity 1v01[1]
 
Cmmi%20 model%20changes%20for%20high%20maturity%20v01[1]
Cmmi%20 model%20changes%20for%20high%20maturity%20v01[1]Cmmi%20 model%20changes%20for%20high%20maturity%20v01[1]
Cmmi%20 model%20changes%20for%20high%20maturity%20v01[1]
 
Cmmi 26 ago_2009_
Cmmi 26 ago_2009_Cmmi 26 ago_2009_
Cmmi 26 ago_2009_
 
Creation use-of-simple-model
Creation use-of-simple-modelCreation use-of-simple-model
Creation use-of-simple-model
 
Introduction to bayesian_networks[1]
Introduction to bayesian_networks[1]Introduction to bayesian_networks[1]
Introduction to bayesian_networks[1]
 
Workshop healthy ingredients ppm[1]
Workshop healthy ingredients ppm[1]Workshop healthy ingredients ppm[1]
Workshop healthy ingredients ppm[1]
 
The need for a balanced measurement system
The need for a balanced measurement systemThe need for a balanced measurement system
The need for a balanced measurement system
 
Magic quadrant
Magic quadrantMagic quadrant
Magic quadrant
 
6 six sigma presentation
6 six sigma presentation6 six sigma presentation
6 six sigma presentation
 
Volvo csr suppliers guide vsib
Volvo csr suppliers guide vsibVolvo csr suppliers guide vsib
Volvo csr suppliers guide vsib
 
Just in-time and lean production
Just in-time and lean productionJust in-time and lean production
Just in-time and lean production
 
History of manufacturing systems and lean thinking enfr
History of manufacturing systems and lean thinking enfrHistory of manufacturing systems and lean thinking enfr
History of manufacturing systems and lean thinking enfr
 
Using minitab exec files
Using minitab exec filesUsing minitab exec files
Using minitab exec files
 
Sga iso-14001
Sga iso-14001Sga iso-14001
Sga iso-14001
 
Cslt closing plenary_portugal
Cslt closing plenary_portugalCslt closing plenary_portugal
Cslt closing plenary_portugal
 
Une 66175 presentacion norma 2006 por julio
Une 66175 presentacion norma 2006 por julioUne 66175 presentacion norma 2006 por julio
Une 66175 presentacion norma 2006 por julio
 
Swebokv3
Swebokv3 Swebokv3
Swebokv3
 
An architecture for data quality
An architecture for data qualityAn architecture for data quality
An architecture for data quality
 
Sap analytics creating smart business processes
Sap analytics   creating smart business processesSap analytics   creating smart business processes
Sap analytics creating smart business processes
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 

A survey of controlled experiments in software engineering

  • 1. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 733 A Survey of Controlled Experiments in Software Engineering Dag I.K. Sjøberg, Member, IEEE Computer Society, Jo E. Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovi, Nils-Kristian Liborg, and Anette C. Rekdal c Abstract—The classical method for identifying cause-effect relationships is to conduct controlled experiments. This paper reports upon the present state of how controlled experiments in software engineering are conducted and the extent to which relevant information is reported. Among the 5,453 scientific articles published in 12 leading software engineering journals and conferences in the decade from 1993 to 2002, 103 articles (1.9 percent) reported controlled experiments in which individuals or teams performed one or more software engineering tasks. This survey quantitatively characterizes the topics of the experiments and their subjects (number of subjects, students versus professionals, recruitment, and rewards for participation), tasks (type of task, duration, and type and size of application) and environments (location, development tools). Furthermore, the survey reports on how internal and external validity is addressed and the extent to which experiments are replicated. The gathered data reflects the relevance of software engineering experiments to industrial practice and the scientific maturity of software engineering research. Index Terms—Controlled experiments, survey, research methodology, empirical software engineering. æ 1 INTRODUCTION T HERE is an increasing understanding in the software engineering community that empirical studies are needed to develop or improve processes, methods, and Empirical Assessment Evaluation in Software Engineer- ing (EASE) is partially included in that 10 selected articles from EASE appear in special issues of JSS, EMSE, and tools for software development and maintenance [6], [4], IST. We consider the above journals to be leaders in [35], [16], [43], [5], [32], [41], [50], [15]. An important software engineering. ICSE is the principal conference in category of empirical study is that of the controlled software engineering, and ISESE, Metrics, and EASE are experiment, the conducting of which is the classical major venues in empirical software engineering that scientific method for identifying cause-effect relationships. report a relatively high proportion of controlled software This paper reports on a survey that quantitatively engineering experiments. characterizes the controlled experiments in software Research in empirical software engineering should aim engineering published in nine journals and three con- to acquire general knowledge about which technology ference proceedings in the decade from 1993 to 2002. The (process, method, technique, language, or tool) is useful journals are ACM Transactions on Software Engineering for whom to conduct which (software engineering) tasks in Methodology (TOSEM), Empirical Software Engineering which environments. Hence, this survey focuses on the kind (EMSE), IEEE Computer, IEEE Software, IEEE Transactions of technology being studied in the experiments investigated on Software Engineering (TSE), Information and Software (which reflects the topics of the experiments), the subjects Technology (IST), Journal of Systems and Software (JSS), that took part, the tasks they performed, the type of Software Maintenance and Evolution (SME), and Software: application systems on which these tasks were performed, Practice and Experience (SPE). The conferences are the and the environments in which the experiments were International Conference on Software Engineering (ICSE), conducted. This survey also includes data on experiment replication and the extent to which internal and external the IEEE International Symposium on Empirical Software validity is discussed. Engineering (ISESE), and the IEEE International Sympo- The paper is organized as follows: Section 2 describes sium on Software Metrics (METRICS). The conference related work. Section 3 defines the research method for the survey. Section 4 reports the extent of controlled experi- . D.I.K. Sjøberg, J.E. Hannay, O. Hansen, V.B. Kampenes, and ments, and Sections 5-10 report our main findings. A. Karahasanovi are with the Simula Research Laboratory, Department c Section 11 discusses threats to the validity of this survey. of Software Engineering, P.O. Box 134, NO-1325 Lysaker, Norway. Section 12 summarizes. E-mail: {dagsj, johannay, oveha, vigdis, amela}@simula.no. . N.-K. Liborg is with BNP Paribas, Karl Johansgt. 7, NO-0154 Oslo, Norway. E-mail: nils@liborg.no. . A.C. Rekdal is with Unified Consulting, Ruseløkkveien 14, NO-0251 Oslo, 2 RELATED WORK Norway. E-mail: acr@unified.no. Table 1 summarizes the purpose, scope, and extent of Manuscript received 6 Jan. 2005; revised 22 June 2005; accepted 27 June 2005; four major surveys in addition to this survey. Tichy et al. published online 15 Sept. 2005. [43] compare the amount of experimental work published Recommended for acceptance by B. Littlewood. For information on obtaining reprints of this article, please send e-mail to: in a few computer science journals and conference tse@computer.org, and reference IEEECS Log Number TSE-0005-0105. proceedings with the amount of experimental work 0098-5589/05/$20.00 ß 2005 IEEE Published by the IEEE Computer Society
  • 2. 734 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 TABLE 1 Surveys of Empirical Studies in Software Engineering published in a journal on artificial neural network and a 3 RESEARCH METHOD journal on optical engineering. In total, 403 articles were This section describes the kind of experiments that are surveyed and classified into five categories: formal theory, considered in this survey and the procedure for identifying design and modeling, empirical work, hypothesis testing, and and analyzing the relevant articles. other. Zelkowitz and Wallace [49] propose a taxonomy of empirical studies in software engineering and report a 3.1 Controlled Experiments in Software survey in which 612 articles were classified within this Engineering taxonomy. Glass et al. [20] investigate 369 articles with Shadish et al. [37] provide the following definitions: respect to topic, research approach, research method, reference discipline, and level of analysis. . Experiment: A study in which an intervention is The above surveys give a comprehensive picture of deliberately introduced to observe its effects. . Randomized experiment: An experiment in which research methods used in software engineering. They differ units are assigned to receive the treatment or an in purpose, selection criteria, and taxonomies. Nevertheless, alternative condition by a random process such as their results suggest the same conclusions: The majority of the toss of a coin or a table of random numbers. published articles in computer science and software . Quasi-Experiment: An experiment in which units are engineering provide little or no empirical validation and not assigned to conditions randomly. the proportion of controlled experiments is particularly low. . Correlation study: Usually synonymous with nonex- The surveys propose means to increase the amount of perimental or observational study; a study that empirical studies and their quality. simply observes the size and direction of a relation- The major difference between those surveys and ours is ship among variables. that they describe the extent and characteristics of various To identify the effects of the deliberate intervention in an types of empirical study, while we provide an in-depth experiment, factors that may influence the outcome, in study of controlled experiments only. A comparison of addition to the treatment, should also be controlled.1 This is those surveys and ours regarding the extent of controlled the challenge of internal validity (see Section 10.1). Note experiments is provided in Section 4. that control is not an all or nothing condition; the degree of In addition to the general surveys described above, there control varies on a continuous scale. are several surveys within subdisciplines of software en- Based on the definitions given above, we present an gineering, for example, object-oriented technology [14], operational definition used for this survey. Since the term testing techniques [28], and software effort estimation [25]. Furthermore, Shaw [38] categorizes the research reported in 1. Some definitions are very explicit on the aspect of control, for example, articles submitted and accepted for ICSE 2002 and Zendler Zimney [52] defines a psychological experiment as an “objective observa- tion of phenomena which are made to occur in a strictly controlled situation [51] reports a survey of 31 experiments with the aim of in which one or more factors are varied and the others are kept constant,” developing a preliminary theory about software engineering. see discussion of this definition in [10].
  • 3. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 735 TABLE 2 Aspects and Their Attributes for Data Extraction “experiment” is inconsistently used in the software en- 3.2 Identification of Articles that Report Controlled gineering community (often used synonymously with Experiments empirical study), we use the term “controlled experiment:” In order to identify and extract controlled experiments, Controlled experiment in software engineering (operational one researcher systematically read the titles and abstracts of definition): A randomized experiment or a quasi-experiment 5,453 scientific articles published in the selected journals in which individuals or teams (the experimental units) and conference proceedings for the period 1993-2002. conduct one or more software engineering tasks for the sake Excluded from the search were editorials, prefaces, article of comparing different populations, processes, methods, summaries, interviews, news, reviews, correspondence, techniques, languages, or tools (the treatments). discussions, comments, reader’s letters, and summaries of We do not distinguish between randomized experiments tutorials, workshops, panels, and poster sessions. and quasi-experiments in this survey because both experi- If it was unclear from the title or abstract whether a mental designs are relevant to empirical software engineer- controlled experiment was described, the entire article was ing experimentation. Random assignment of experimental read by both the same researcher and another person in the units to treatments may not always be feasible, e.g., for project team. In the end, 103 articles were selected. Note logistic reasons. For example, one of the surveyed experi- that identifying the relevant articles is not straightforward ments used units formed from existing training groups in a because the terminology in this area is confusing. For company—random assignment would, in this case, have example, several authors claim that they describe experi- disturbed the training process. ments even though no treatment is applied in the study. We exclude several types of study that share certain characteristics with controlled experiments because, while 3.3 Analysis of the Articles these may be highly relevant for the field, they do not apply The survey data is stored in a relational database (MS SQL the deliberate intervention essential to controlled experi- Server 2000).2 Some data is specific to an article, some is ments. Thus, we exclude correlation studies, studies that are specific to an experiment, and some information concerns solely based on calculations on existing data (e.g., from data the combination of article and experiment. Moreover, an mining), and evaluations of simulated teams based on data article might describe several experiments and an experi- for individuals. The last category falls outside our opera- ment might be described in several articles, typically with a tional definition because the units are constructed after the different focus in each article. Consequently, we defined a run of the experiment. data model with the entities article, experiment, and article- Studies that use projects or companies as treatment experiment with a set of attributes relevant to our survey. In groups, in which data is collected at several levels addition to the survey database, a catalog of all the articles (treatment defined, but no experimental unit defined) are in searchable pdf-format was generated. (About 3/4 of the also excluded because we consider these to be multiple case articles were provided in searchable pdf-format by the studies [47] (even though the authors might refer to them as journal publishers; the remaining 1/4 were OCR-scanned.)3 experiments). Our survey focuses on articles that provide The articles were analyzed according to the six aspects the main reporting of experiments. This excludes articles listed in Table 2. Each aspect encompasses a set of attributes that, at the outset, would not provide sufficient data for our for data extraction. analyses (e.g., summaries of research programs). Moreover, usability experiments are not included since we regard 2. MS SQL Server 2000 is a registered trademark of Microsoft Corp. 3. The survey database and catalog of articles may be provided upon those as part of another discipline (human computer request to the corresponding author and under the conditions of a signed interaction). agreement toward the use of the data.
  • 4. 736 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 TABLE 3 Articles that Report Controlled Experiments Six researchers analyzed the articles so that each aspect definition is narrower than those of the other studies or above was covered by at least two persons. After the initial because our investigation spans more sources and years. analysis, the results were compared and possible conflicts We rank institutions and scholars according to the resolved by reviewing the articles collectively a third time number of experiments published (not the quality), but or handing over the article to a third person. The main relative to their fractional representation on the article(s) analysis tool was SAS.4 that reports the experiment. Glass and Chen [18] also ranked institutions and scholars but according to publication in systems and software engineering, and they used a more 4 EXTENT complex ranking scheme for scholars. Controlled experiments, as defined in Section 3.1, are In total, 207 scholars are involved in the experiments of reported in 103 (1.9 percent) of the 5,453 articles scanned our survey. Table 4 presents the top 20 ranked scholars. Due for this survey, see Table 3. The 103 articles report a total of to the fractional distribution, the number of experiments in 113 controlled experiments. Twelve articles report more which a scholar has actually been involved is typically than one experiment and four experiments are reported in higher than the scores in Table 4. For instance, the top several articles. ranked scholar, Giuseppe Visaggio, was involved in EMSE, ISESE, and METRICS, which focus specifically on six experiments described in four papers authored by one empirical software engineering, report a higher proportion to three scholars, resulting in a fractional score of 4.2 experi- of controlled experiments than the other journals and the ments. Among the 20 top ranked scholars, three (Laiten- ICSE conference. The mean percentage of controlled berger, Roper, Wood) were involved in eight experiments, experiments across years varies between 0.6 and 3.5, but one was involved in seven, four in six, two in five, nine in we see no marked trend over years. An overview of the four, and one was involved in three experiments. trend for the individual journals/conferences is presented There are 109 institutions from 19 countries involved in in the Appendix. the experiments of our survey. The scores for institutions The surveys summarized in Table 1 also report extent. are accumulated from the scores of affiliated authors. Tichy et al. have the study type definition with the broadest Table 5 presents the top 10 ranked institutions. scope and report that 14 percent of the articles published in The institution that has used most professionals as the specific software engineering journals TSE and TOPLAS subjects throughout the surveyed time period is Fraunhofer (Transactions on Programming Languages and Systems) de- Institute, Kaiserslautern. In total, they used 100 professionals scribe empirical work. In Glass et al., the authors classify in six experiments, ranging from 11 to 20 in a single 3 percent of the articles as laboratory experiments using experiment. The institution that conducted the experiment human subjects and 1 percent as field experiments. involving the largest number (68) of professionals as subjects According to the survey by Zelkowitz and Wallace, was Andersen Consulting (now Accenture), Norway. experiments defined as controlled methods are reported in 2.9 percent of the articles. Our survey finds the lowest 5 TOPICS percentage of articles (1.9 percent) that report controlled experiments. This might be because our study type This section describes two classifications of the 103 analyzed articles according to their main topic. The first classification 4. SAS is a registered trademark of SAS Institute Inc. illustrates the experiments’ discipline coverage relative to
  • 5. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 737 TABLE 4 Top 20 Scholars Conducting Controlled Experiments in Software Engineering 1993-2002 software engineering as a whole, while the second uniformity across all three fields [19]. The scheme is, classification has a more technical focus on software therefore, somewhat general. On the other hand, this engineering method and methodology. The analysis is with scheme has actually been used in classifying work under- respect to article, rather than experiment; this is adequate taken in software engineering and can therefore be used for since no two experiments on different topics are reported in illustrating the relative topic coverage of controlled experi- the same article. Both classifications emphasize the treat- ments in software engineering. ment of an experiment, since treatment, being the interven- Fig. 1 shows the distribution to topic categories of tion of interest (Section 3), indicates the de facto topic under controlled experiments in software engineering relative to investigation. software engineering research in general. Controlled ex- 5.1 Classification Scheme: Glass et al. periments seem to cover at least the categories that are well There are a number of classification schemes for computing represented by general SE research, but remember that the science and software engineering, e.g., SWEBOK [1] and overall number of controlled experiments performed is low Glass et al. [20]. The classification scheme of Glass et al. is (Section 2). Recall that experiments on topics purely within aimed at positioning software engineering research relative human computer interaction are not included in this to a backdrop of overall computing disciplines, i.e., survey, as is the case for topics purely within information computer science, software engineering, and information systems. Our focus on experiments with human subjects systems, and their classification categories are meant to give also excludes a range of software engineering topics. TABLE 5 Top 10 Institutions Conducting Controlled Experiments in Software Engineering 1993-2002
  • 6. 738 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 Fig. 1. Comparative distribution to topic of software engineering research and software engineering controlled experiments using the scheme of Glass et al. Only nonvoid categories are shown. The two most prominent categories are Software life- The number of participants per experiment ranges from cycle/engineering (49 percent) and Methods/Techniques four to 266, with a mean value of 48.6 (Table 8). Students (32 percent) due to, respectively, the relatively large number participated in 91 (81 percent) of the experiments, either of experiments on inspection techniques and object-oriented alone or together with professionals and/or scientists, and design techniques. professionals took part in 27 experiments (24 percent). The use of professionals as subjects has been relatively stable 5.2 Classification Scheme: IEEE Keyword over time. Undergraduates are used much more often than Taxonomy graduate students. For one experiment, no information The IEEE Keyword Taxonomy [24] provides a more about subject type was provided; for eight experiments, no technical perspective than the scheme of Glass et al. [20]. details about type of students were given; and for This taxonomy is recommended for authors of IEEE articles, five experiments with mixed types of subject, no informa- and is an extended version of the ACM Computing tion about the number in each category was provided. Classification [2]. The issue of differences between students and profes- We use the IEEE keywords to denote topic categories. sionals has been discussed in the literature [13], [12], [48], [36]. The classification according to the IEEE Keyword Taxon- Interestingly, while seven articles describe experiments using omy is given in Table 6. The two prominent technical areas both students and professionals, only three of them measure are Code inspections and walkthroughs (35 percent) and Object- the difference in performance between the two groups. In the oriented design methods (8 percent). The numbers of experi- first experiment, categorized as Software psychology, three ments are limited for other areas. programming tasks were performed. For two of the tasks, there was no difference between the groups, whereas, for the 6 SUBJECTS third task, the professionals were significantly better. In the This section describes the kind of subjects that take part in second experiment, also in Software psychology, there was no the experiments, the kind of information that is reported difference. In the third experiment, categorized as Main- about them, and how they are recruited. tenance process, the professionals were significantly better. The performance in an experiment may differ between 6.1 Number and Categories of Subjects in the subcategories of subjects, that is, there may be an interaction Experiments effect between treatment and subcategory [3]. However, In total, 5,488 subjects took part in the 113 experiments none of the surveyed experiments distinguished between investigated in this survey. Eighty-seven percent were subcategories of professionals or students. students and nine percent were professionals. The reported The experiment with the highest number of professionals subject types are divided into the categories given in Table 7. (68) was classified as Cost estimation in the IEEE-taxonomy
  • 7. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 739 TABLE 6 Classification of Articles According to IEEE Taxonomy (Table 6). Then, there were five experiments with “Non-random drop-out of subjects has been avoided by the 29-35 professionals, of which one also employed 20 students. experimental design, i.e. assignment of groups only on the second These five were categorized in descending order (regarding day of the experiment, i.e., directly before the treatment, and not number of subjects) as Modules and Interfaces, Code Inspec- before the pre-test already on the first day of the experiment.” tions and walkthroughs, Maintenance process, Software Psychol- However, most articles say nothing about how mortality ogy (understanding code), and Patterns. was managed. The total number of participants was reported in all the There are good reasons for conducting experiments with articles, either explicitly or implicitly; in the latter case, we students as subjects, for example, for testing experimental could roughly calculate the number (for instance, from the design and initial hypotheses, or for educational purposes information that 10 teams averaging four subjects partici- [42]. Depending on the actual experiment, students may also pated). Subject mortality (drop-outs) was reported in be representative of junior/inexperienced professionals. 24 experiments (2 percent mortality on average). Even in However, the low proportion of professionals used in experiments with as many as 266 subjects (as well as many software engineering experiments reduces experimental other experiments with a relatively high number of realism, which in turn may inhibit the understanding of subjects), no mortality was reported. One article states that industrial software processes and, consequently, technology
  • 8. 740 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 TABLE 7 Subject Categories transfer from the research community to industry. Hence, to level of detail reported varies substantially. An example of break the trend of few professionals as subjects, new detailed information on programming experience is: “On strategies are needed to overcome these challenges, see average, subjects’ previous programming experience was 7.5 years, e.g., discussions in [39], [40]. using 4.6 different programming languages with a largest program of 3510 LOC. Before the course, 69 percent of the 6.2 Information about Subjects subjects had some previous experience with object-oriented In order to generalize from an experiment with a given programming, 58 percent with programming GUIs.” An group of subjects (sample population), one needs informa- example of a high-level description without figures is: tion about various characteristics and their variation both in “Some of the students had industrial programming experience.” the sample and in the group to which the results will be For the 91 experiments with students, the following generalized (target population) [7]. For professionals, information was reported: gender (seven experiments), age depending on what we wish to study, it would be relevant (six experiments), grades (six experiments), programming to know the variations regarding competence, productivity, experience (general description: 17 experiments, number of education, experience (including domains), age, culture, etc. years/languages: 11 experiments), work experience in However, there is no generally accepted set of background industry (general description: nine experiments, number variables for guiding data collection in a given type of of years: nine experiments), task-related experience (64 ex- study, simply because the software engineering community periments), and task-related training (27 experiments). The does not know which variables are the important ones. We training was either tailored specifically for the experiment have chosen to focus on the variables that are reported in or was part of a course, or the experiment could be the analyzed articles, that is, gender, age, education, conducted as part of a training session. experience, and task-related training. For the 27 experiments with professionals, more details The analyzed articles vary to a large extent on how they on the subjects’ background were given. Categories of report such information. For 14 of the 113 experiments, no professionals such as reviewers, analysts, programmers, and information about the subjects was reported. Moreover, the managers were given for seven experiments. Subjects’ TABLE 8 Participants in Experiments
  • 9. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 741 TABLE 9 If one applies statistical hypothesis testing, a requirement Subject Reward Data is to have a well-defined population from which the sample is drawn: “If you cannot define the population from which your subjects/objects are drawn, it is not possible to draw any inference from the results of your experiment” [30].5 Nevertheless, only one of the experiments in this survey that apply statistical hypothesis testing actually reported sampling from a well- defined target population. For only a couple of experiments, random sampling of subjects was claimed. How the random sampling was carried out was not described. The dominant approach was convenience sampling: “Subjects are selected because of their convenient accessibility to the researcher. These subjects are degrees were described for three experiments. Gender and chosen simply because they are the easiest to obtain for the study. age were given for, respectively, two and three experiments. This technique is easy, fast and usually the least expensive and Language and nationality were given for one experiment troublesome. ... The criticism of this technique is that bias is introduced into the sample.” [34]. This does not mean that (subjects from two countries participated). A general convenience sampling is generally inappropriate. For description of programming experience was given for example, Ferber [17] refers to the exploratory, the illustrative, two experiments. Programming experience in years/lan- and the clinical situations in which convenience sampling guages was given for seven experiments. A self-assessment may be appropriate. In software engineering, the most of programming experience was reported for two experi- convenient way of recruiting subjects is to use the students ments. Work experience in years was given for that are taught by the researcher. (Note that convenience five experiments. A general description of task-related sampling is also common in other disciplines such as experience was reported in one experiment. Task-related clinical medicine [34] and social sciences [33].) experience was measured in years for 13 experiments. Task- To increase the potential for sampling subjects from a related training was reported for 12 experiments. well-defined population and to alleviate the problem of The relatively low and arbitrary reporting on context having few professionals as subjects (Section 6.1), the variables is a hindrance for metastudies, which are needed experimental software engineering community should to identify which context factors influence which kinds of apply new incentives, for example, paying companies performance. The impact of the various context factors will, directly for the hours spent on an experiment [3] or offer of course, depend on the treatments and actual tasks to be the companies tailored, internal courses where the course exercises can be used in experiments [27]. Payment would performed in the experiments. Future work should require that researchers include expenses for this kind of investigate the extent to which the variation in performance experiment in their applications to funding bodies, see of subjects can be explained by their background, such as further discussion in [39]. education and work experience, and to increase our knowledge of the impact of using students versus profes- sionals as subjects in software engineering experiments. 7 TASKS The tasks that subjects are asked to carry out are an 6.3 Recruitment of Subjects important characteristic of a software engineering experi- Recruiting subjects to experiments is not a trivial task; either ment. Tasks may include building a software application from a methodological or a practical point of view. For from scratch or performing various operations on an example, volunteers may bias the results because they are existing application. This section reports on the surveyed often more motivated, skilled, etc., than subjects who take experiments according to a high-level categorization of their part because it is mandatory in some way [8]. Information tasks and the duration of those tasks. Moreover, we about whether participation was mandatory is reported for describe the total magnitude of the experiments by reporting 41 (36 percent) of the experiments. For 12 of them (all the product of the number of subjects and the duration of student experiments), participation was mandatory. Infor- the tasks. Finally, we describe the kinds and size of mation about subject compensation for taking part in application and materials used in the experiments. experiments is reported for 39 (35 percent) of the experi- 7.1 Task Categorization ments. The grades of students were affected by the We categorize tasks given to subjects according to the main participation in 10 cases and they received extra credits in tasks in a software process. We have defined four general nine cases (Table 9). In three cases, students were paid to categories, Plan, Create, Modify, and Analyze that reflect take part and, in one case, students were sponsored for a major tasks on software artefacts. Table 10 shows sub- trip to an exhibition. No compensation to professionals is categories within these major categories that have been reported. Typically, the experiments with professionals identified in the surveyed experiments. were organized as part of normal projects or training programs and payment was thus implicitly provided by the 5. This claim should, as we understand it, not be interpreted outside the employer. Hence, it seems that none of the researchers or context of statistical hypothesis testing. Obviously, even a study without a well-defined population (but with a well-defined sample) may enable the research teams paid companies or professionals for taking researcher to infer about similar projects, e.g., based on argumentation by part in experiments. analogy or by theory, see further discussion in [26].
  • 10. 742 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 TABLE 10 Task Categorization, Duration, and Materials Size Task categorization is somewhat different from topic experiments involve individual inspections, 29 involve categorization. Tasks of a certain category can be used to team inspections. Twenty-eight experiments involve both test hypotheses within various topics. For example, a individual and team inspections. Inspection tasks are maintenance task can be used to test a certain design or typically conducted using pen and paper, although some an experiment assigned to the Patterns category in the IEEE use support tools. taxonomy might have design, coding, or maintenance tasks. Document comprehension tasks form the basis of various Table 10 shows the number of experiments deploying software engineering tasks and typically involve answering each kind of task. Note that tasks of several categories might questions about system structure and functionality. be involved in a single experiment. A task is represented by Twenty-three experiments involve document comprehen- its fraction of all tasks in an experiment, for example, an sion tasks, 12 of these pertain to code documents, nine are experiment with one Design task and one Coding task gives design comprehension, one concerns a requirements docu- a contribution of 0.5 to each of the two task categories. (Note ment, and one concerns process components. also that we do not distinguish between tasks and subtasks Maintenance tasks pertain to all document types. The because there is no commonly agreed definition of the unit surveyed experiments, however, only deal with design of task. Hence, in the relatively few cases in which the and code maintenance tasks; 19 experiments give code experimenters have divided their tasks into subtasks, we maintenance tasks and three give change tasks on design have considered them as one task as long as they fall within documents (including impact analyses). None give both. the same category.) Due to experiment design, a task may In 10 of the code maintenance tasks, new functionality was be performed several times by the same subjects but with added. One of the code maintenance tasks is a pen-and- different treatments. In such cases, however, the task is only paper maintenance task performed jointly with a compre- counted once. hension task. The proportion of planning, creation, modification, and Coding and Design are tasks in which new artefacts are analysis tasks is, respectively, 10 percent, 20 percent, produced. Modifying existing code or design documents is 16 percent, and 54 percent. Inspection tasks occur in 37 classified as maintenance. (33 percent) of the experiments, and are by far the most Most of the Testing tasks involve the generation of test prominent. This is in accordance with the topic classifica- harnesses and test cases. Testing here also includes tion of articles reported in Section 5. Thirty-six of these debugging using debugging tools, but excludes inspections.
  • 11. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 743 7.2 Task Duration An important task characteristic is duration. Accurate duration data per subject (typically in dependent variables) is reported in 41 (36 percent) of the experiments and at slot level in 28 (25 percent) of the experiments. (Time slots are coarse-grained indications, typically upper bounds, of how much time the subjects took to perform a task. For example, we chose a slot of two hours from the information that “We gave each subject up to three hours to review each document (i.e., one document in the morning, and one in the afternoon). Only one subject took more than two hours.”) Duration data that is Fig. 2. Distribution of experiments with subject-level duration data to not considered sufficient for analysis is contained in phrases time intervals. like “Six days, no time limit,” “From 1 to 11 days depending on the subjects’ skills,” and “Non-programming subjects had 30 min. Three experiments investigate the effects of preplanning to finish their task. Programming subjects had one week”. estimates on detailed estimates (anchoring). In one of these, Fig. 2 shows the frequency by time interval of the the Estimation task is part of a Project planning exercise. 41 experiments with detailed subject-level time information. One experiment involves estimation in a larger project, It appears that about 2/3 of those experiments last less than although project planning as such is not a task in the two hours. experiment in question. Two experiments issue estimation The two leftmost “Subject level” columns of Table 10 show, for each task category respectively, the fraction of tasks in order to compare various estimation techniques. Four of the five experiments with Project planning tasks and the number of experiments with subject-level duration data that include tasks of this category. The third “Subject are all-student experiments in which the subjects were level” column shows the median duration in hours for these asked to role-play in project planning or to simulate tasks. For example, three experiments have duration data projects. The fifth one involves both professionals and for design tasks. The fraction of the time spent on design students assessing how 10 different factors affect the lead- activities in these three experiments, relative to the total time of software development projects. time for all experiments (with subject-level time data), is 2.3. In the experiment involving Requirements analysis, the The median time used on design tasks in these experiments subjects were asked to negotiate a software requirements is 0.85 hours. The median duration of all experiments with meeting with a customer. subject-level data is 1.0 hours and 2.0 hours for the Forty experiments deploy tasks in several categories. experiments with slot-level data. Among these experiments, five involve three tasks, two Fig. 3 shows the actual task duration for the subject-level involve four tasks: one has comprehension, maintenance, occurrences of Table 10. In the interest of saving space, four and inspection (individual and team) tasks, and one has data points at, respectively, 25 (Change code), 18.5 (Coding), design, coding, team inspection, and testing. 18.5 (Design), and 55 hours (Coding) are omitted from the Fig. 3. Task categories and subject-level duration data.
  • 12. 744 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 TABLE 11 Distribution of Experiments to Subject Number, Duration, and Subject-Duration Categories figure. It appears that there is large variance in duration and We here categorize the 69 experiments with duration data that it seems independent of the type of task being (41 with subject-level data and 28 with slot-level data), performed. according to subject numbers and task duration into, Little is mentioned in the articles about control over respectively, S (small), M (medium), and L (large), such context variables in experiments with multiple sessions, idle that each category contains roughly 1/3 of the experiments. periods, or that span several days or weeks. Although the In practice, this gives the following categories: For subject issue in principle concerns all experiments, it would be numbers, S: 23, M: 23-47, and L: 47. For duration, S: particularly interesting to know how experimenters have 0.96 hours, M: 0.96-2.9, and L: 2.9 hours. (The subject ensured control in experiments that involve long tasks. groups cannot be made completely even because there are The data described above reflects the duration of explicitly six experiments with 24 subjects.) The person-hours cate- measured software engineering-specific tasks as described in gorization is obtained by crossing these two categorizations Section 7.1. Often, however, subjects perform additional in configurations (subject category, duration category) as tasks (training, preparation, postmortem questionnaires, etc.) whose durations are not captured in dependent follows: S (small): (S,S), (S,M), (M,S); M (medium): (S,L), variables or are otherwise measured explicitly. If one wants (M,M), (L,S); L (large): (M,L), (L,L), (L,M). Table 11 shows to reflect the total time spent by subjects (perhaps in the that experiments with professionals use a smaller number of interest of logistics), information at a different level must be subjects than do experiments with students. Both experi- gathered. Although most experiments (close to 80 percent) ments with students and experiments with professionals provide some sort of information about total experiment have a uniform distribution for the three levels of duration. duration, the data is, in general, measured and reported Regarding magnitude, most student experiments are in the arbitrarily and is consequently difficult to summarize here. middle category and a fair number are large, while most The median duration of the tasks of 1.0/2.0 hours is, of experiments with professionals are small and only one course, a very small fraction of the time of a typical experiment is large. industrial development project. The extent to which short tasks are a threat to external validity is difficult to judge in 7.4 Application and Materials general. The actual tasks in the experiments may be Applications may be of various types, such as commercial, representative of typical industrial (sub)tasks. However, open source, the result of a student project, or custom- the lack of studies that describe “typical” tasks within built for the purpose of the experiment. Table 12 shows certain categories and contexts makes such a judgment that 75 percent of the surveyed experiments involved difficult. More studies are needed to investigate the applications that were either constructed for the purpose relevance of the tasks being conducted in software of the experiment or were parts of student projects. engineering experiments. Commercial applications were used in 16 experiments, of 7.3 Magnitude of Experiments—Combination of which 10 included inspection tasks (eight of these had Number of Subjects and Duration team inspections in addition to individual inspections), Many aspects of the complexity of software engineering two included design tasks, one had coding and main- only manifest themselves in controlled experiments if the tenance (change code), one had coding only, one had experiments involve a sufficiently large number of subjects and tasks, for example, differences among subgroups of TABLE 12 subjects [3]. Hence, we can characterize the experiments in Distribution of Experiments to Application Type terms of the scale of the combination of subjects and tasks (here, measured in terms of duration of the task). The magnitude of an experiment can be described in terms of the total number of person-hours or person-days; that is, the number of subjects multiplied with the length of the tasks. In this survey, the experiment with the largest number of professionals lasts less than one hour. However, in general, there seems to be no significant relationship between duration and the number of subjects.
  • 13. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 745 (design) comprehension and maintenance (change design), two maintenance tasks: 50-150 LOC and 35-79 LOC, and one had estimation. For 12 experiments, the reporting respectively. is unclear in this respect, but 11 of these appear to have used custom-built applications. There are no open-source 8 ENVIRONMENTS applications registered in this survey. The small fraction of commercial or industrial applications used in current The strength of controlled experiments lies in that they may software engineering experiments puts in question the be used to isolate causal relationships. However, controlled possibility of generalizing the experimental results to an experiments in the field of software engineering are often industrial setting. conducted in artificially designed environments that make The size of the materials presented to subjects gives some it difficult to generalize the results to industrial contexts. In indications of the comprehensiveness of the experiments. short, “Internal and external validity can be negatively Size in the form of pages, lines of code (LOC), or other related” [37]. This section describes the surveyed experi- quantities is reported in 67 (59 percent) of the experiments. ments according to their location and tools used. The diversity of the surveyed experiments and how they 8.1 Location report information about materials makes it difficult to give a systematic overview of the size of the experiment There is a trade-off between realism and control regarding materials. Nevertheless, below, we describe in brief the the location of an experiment. Running an experiment in the size of materials per task category, cf. the rightmost column usual office environment of subjects that are professionals of Table 10. allows a certain amount of realism, yet increases the threat Three experiments with Project planning tasks report to internal validity due to breaks, phone calls, and other materials size: a one-page case scenario, a scenario in terms interruptions. Controlling and monitoring the experiment is of 500 adjusted function points, and a four-page program easier in a laboratory set up but, in such a setting, realism representation, respectively. The three Estimation experi- suffers. ments are based on a 1,000 person-hour project, a four-page For the 27 experiments with professionals or with program representation, and on the creation of 10 programs professionals and students, 17 report no explicit informa- (no sizes on these), respectively. tion about the experimental setting. Only one experiment is Five of the six experiments with size indications for reported to have been run in a usual office environment. Design tasks are on requirements documents (1-2 pages, The pilot of another experiment was run in an office six modules). The sixth experiment gives a one-page task environment, but the main experiment was run in a description. Materials size is reported in one instance for a classroom setting in order to increase the internal validity. coding task (specification given in three tables). Three other experiments were run in a classroom setting, Two experiments with Change design tasks have a 30 page two of which were run as part of a training course. Seven design document as material (one experiment is an internal experiments are explicitly reported to have been run in a replication of the other; the materials of the former are laboratory environment. improved, but their sizes are more or less the same), and Of the 85 experiments with students or with students one has a 1,500 LOC system as input. In the experiments and scientists, 56 report no explicit information about the with Change code tasks, the applications to be maintained experimental setting. For, respectively, 13 and seven of range from 54 to 2,700 LOC. The largest application also those experiments, it was explicitly stated that they were involves 100 pages of documentation. Three experiments conducted in a laboratory and classroom. For another group report the number of classes (6-10). Twenty-eight experiments give materials size for Inspec- of nine experiments, some sort of university setting was tion tasks (individual or team). Fourteen give LOC (ranging stated, for example, “academic settings,” “conducted under from 135-3,955 LOC). In one instance, the materials size is exam conditions,” and “supervised setting.” However, one given as 300 LOC, but the size of the entire system is may assume that all the experiments with students were 65,000 LOC. Page counts (ranging from 16-47 pages) are carried out in a laboratory or classroom. Moreover, we given in 14 instances (all different from the 14 with LOC). believe that the distinction between a classroom and Materials size for Testing tasks (25-2,000 LOC) is reported laboratory setting for students may be blurred and may in seven experiments (one also reports 10 classes). Re- depend on cultural differences, apart from the fact that a ported materials sizes for Document comprehension tasks are laboratory usually would include the use of PCs or varied (five diagrams, seven screens, 16 screenshots, etc.), workstations (2/3 of the experiments that report the use but five experiments give LOC (92-2700 LOC) for Code of a laboratory also report the use of PC or workstation, see comprehension tasks and five experiments give page counts the next section). (2-30 pages) for Design comprehension tasks. Approximately half of all the experiments with students In addition, some experiments (with tasks in the Create report the name of the actual university/college. For the category) report the size of produced task solutions. 27 experiments that include professionals, the name of the Five experiments with Coding give LOC (in the range 86- company is reported in 12 cases. For four experiments, the 2,000 LOC) for produced code, and in one experiment the company is not named, but the type of company is size for a Design task is provided implicitly, in that the specified. Note that, depending on the actual experiment, solution design document written by the experimenters is certain companies have a policy such that they must remain two pages. Also, the amount of added code is given for anonymous in the reporting of the experiment. In five cases,
  • 14. 746 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 TABLE 13 In general, increasing the realism of software engineering Distribution of Experiments to Specified Tool experiments entails an increased use of industrial support- ing tools. The community should thus recognize the effort and resources needed to set up PC or workstation environments with the right licenses, installations, access rights, etc., and to familiarize the subjects with the tools. Moreover, the tools must be checked to demonstrate acceptable performance and stability when many subjects are working simultaneously. In the experiments of this survey, there is almost no the professionals are described as coming from “several” discussion of the relationships among the three dimensions subject, task, and environment. For the community to companies or organizations. The exact number of compa- progress, this issue needs to be investigated. For example, nies is not given.6 a professional development tool will probably become more 8.2 Tools useful the larger and more complex the tasks and It is a challenge to configure the experimental environment application systems become, assuming that the subjects with an infrastructure with supporting tools that resembles are sufficiently proficient with the tool. an industrial development environment. Among the sur- veyed experiments, 55 percent report on tools to support 9 REPLICATION the tasks of the experiments (Table 13). This includes both In this survey, 20 of the experiments are described by the explicit descriptions, e.g., “Sun-4, GNU C compiler,” and authors themselves as replications. These experiments implicit, but clear indications, e.g., “Developed programs were constitute 14 series of replications. Table 14 summarizes run against a set of test data.” Table 13 shows that the use of the series including both the original experiments and the computer tools is slightly higher than the use of pen and replications and reports differences between them. Most paper. However, it is likely that a larger proportion of those replications (35 percent) are conducted in the area of experiments that do not report on tools are actually pen and Inspection (seven replications in series 1, 2, and 3) and paper experiments, because the added effort and adminis- Maintenance (five replications in series 4, 5, and 6). Among trative overhead of using computer tools might inspire the 20 replications, five can be considered as close replica- researchers to report the use of tools more than the use of tions in the terminology of Lindsay and Ehrenberg [31], i.e., pen and paper. one attempts to retain, as much as is possible, most of the The task types that give the largest and smallest known conditions of the original experiment. The other contribution to the PC or workstation category are, respec- replications are considered to be differentiated replications, tively, Coding and Inspection. Other than that, there is little i.e., they involve variations in essential aspects of the correlation between task type and tool for the experiments experimental conditions. One prominent variation involves that actually report on this issue. Moreover, there was no conducting the experiment with other kinds of subject; difference between experiments with professionals and three replications use professionals instead of students, experiments with students regarding the use of tools. three use undergraduates instead of graduates, and one Three of the five experiments with Combination in uses students instead of professionals. Other variations Table 13 explicitly test the effects of computerized tool include conducting the experiment on different application use versus pen and paper. systems (four) and with different tasks (three). The relatively meager proportion of experiments that In all the five close replications, the results of the original report on the use of tools to support assigned tasks may be experiment were confirmed (three were conducted by the due to an unawareness of, or a lack of interest in, the same authors, two by others). Among the 15 differentiated relevance of this issue. For example, most of the experi- replications, seven were conducted by other authors. Six of ments in the Unknown category are inspection experiments, these reported results differing from the original experi- for which it may be normal to use pen and paper. However, ment and one partly confirmed the results of the original for most design, coding, testing, and maintenance tasks, experiment. Among the differentiated replications con- computer tools would have been used in an industrial ducted by the original authors, we found the opposite setting, although the line is not clear-cut. For example, pattern; seven replications confirmed the results of the original experiment and only one reported partly different designers may sketch preliminary versions by hand, but the results. final design would be made using a tool. “Methodological authorities generally regard replication, 6. Before we decided to rely exclusively on the information reported or what is also referred to as “repeating a study,” to be a in the articles, we approached the corresponding authors of these crucial aspect of the scientific method” [31]. However, only five experiments to acquire more information about the extent of companies involved in the experiments. It turned out that in two experiments, the 18 percent of the surveyed experiments were replications. A subjects attended a course aimed at people from industry (the number of discussion of the form and extent of replication that would companies of the participants was unknown). One author replied that it was benefit software engineering is beyond the scope of this a mistake in the article; all participants actually came from the same company. One replied that he did not know, but our impression was that it paper, but should be an issue of discussion for the research was only two companies. The last one did not respond to our request. community.
  • 15. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 747 TABLE 14 Replicated Experiments 10 THREATS TO INTERNAL AND EXTERNAL VALIDITY Threats to internal validity are addressed explicitly for 71 experiments (63 percent). (We did not include threats Two important aspects of the quality of an experiment are their internal and external validity. This section discusses that are addressed implicitly as part of the experimental how, and the extent to which, threats to internal and design.) We classified the reporting of threats to internal external validity are reported for the surveyed experiments. validity according to the scheme of Shadish et al. [37] Descriptions of such threats are made in various ways and shown in Table 15. That table also shows examples of under different headings. For 54 experiments (48 percent of threats in the various categories reported for the surveyed all experiments), there is a special section entitled “Threats experiments. A version of this scheme, with other examples to (internal/external) validity” or other combinations that from software engineering, is presented in [46]. include the terms “threats” or “validity.” Nine other Table 16 shows the distribution of the experiments experiments (8 percent) have special sections on threats to according to the scheme of Table 15. Almost half of all validity but with other names (e.g., “Limitations to the experiments report on selection threats (46 percent) and/or results”). The reporting of threats to validity in yet another instrumentation threats (40 percent). The distribution of eight experiments were found in other sections. number of threat categories reported is as follows: 22 experi- ments report one threat, 11 experiments report two threats, 10.1 Internal Validity 23 report three, 10 report four, four report five, and Internal validity of an experiment is “the validity of inferences one experiment reports seven threats. about whether observed co-variation between A (the presumed Table 16 also shows whether the authors consider the treatment) and B (the presumed outcome) reflects a causal threats to be present but not handled by the authors, or relationship from A to B as those variables were manipulated or reduced or eliminated due to actions taken by the authors measured” [37]. Changes in B may have causes other than, or alternative to, the manipulation of A. Such an alternative or due to other circumstances. Only 18 percent of the threats cause for the outcome is called a confound or confounding (reported in 23 percent of the experiments) are not handled, factor. For further discussions (including formal definitions) but it may be the case that threats that are not reduced or of concepts of confounding, see [21]. eliminated are underreported.
  • 16. 748 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 TABLE 15 Threats to Internal Validity: Reasons Why Inferences that the Relationship between Two Variables Is Causal May Be Incorrect Classifying internal validity is not always straightfor- variation. For example, the notions of statistical conclusion ward. For example, “learning effects” are often classified as validity and construct validity appeared for the first time in “maturation” in the experiments, while this should be 1979 [11].7 “testing” according to the categories given in Shadish et al. 10.2 External Validity [37]. Maturation threats refer to ”natural changes that would External validity concerns inferences about the extent to occur even in the absence of treatment, such as growing older, which a causal relationship holds over variations in hungrier, wiser, stronger, or more experienced,” while testing persons, settings, treatments, and outcomes [37]. This threats refer to effects of practice and familiarity within the section summarizes how the authors report threats to experiment that could be mistaken for treatment effects [37]. external validity regarding these issues. Moreover, threats that by this scheme pertain to statistical Threats to external validity are reported for 78 experi- conclusion validity or construct validity were, for a few ments (69 percent). Table 17 shows a categorization of the experiments, reported as internal validity threats. In part, 7. In addition, there are threats that scholars put in different main this may be due to nontrivial subtleties in threat classifica- categories. For example, what Trochim [44] and Wohlin et al. [46] refer to as tion, illustrated by the fact that the line of development “social threats” are categorized as threats to internal validity by them, but as threats to construct validity by Shadish et al. [37]. Four experiments address starting with Campbell et al. [9], via Cook et al. [11] to the “social threats” in our survey, but since we follow the scheme of Shadish et al., present classification scheme in [37], shows considerable such threats are not included in our survey.
  • 17. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 749 TABLE 16 Threats to Internal Validity threats based on Shadish et al. [37]. Threats regarding Most of the task-related threats concern size and subjects are discussed in a total of 67 experiments (rows complexity of the tasks (16 experiments) and experimental one, four, five, six, and seven), regarding task in 60, material (34), such as program code, inspection documents, environment in 23, and treatment in six. and database systems. For experiments on inspection, one Most threats regarding subjects deal with difficulties of threat discussed was that the inspection process applied generalizing from students to professionals (45 experi- was not considered representative for industrial practice ments). Another category of experiments (14) also uses (nine). The (short) duration of the experiment was also students, but the experimenters argue that this may not be a regarded as a threat (three). One experiment stated that “all threat to validity because the students for this kind of task our results were obtained from one project, in one application would (probably) have the same ability as professionals domain, using one language and environment, within one (seven), because the students were close to finalizing their software organization. Therefore, we cannot claim that our education and starting work in the industry (four), or conclusions have general applicability, until our work has been because one other study showed no difference between replicated.” Another experiment stated that the subjects students and professionals (three). A few experiments might not have used the technique intended to be studied in (three) that use professionals claim that threats to external the experiment. validity were not critical because the experiment was Threats regarding environment were either stated as a conducted with professionals (they did not discuss the problem of generalizing from the experimental setting with representation of their actual sample of professionals). A no specific reasons (five experiments) or stated with few experiments (three) considered that running an experi- concrete reasons for the difficulties: use of laboratory or ment within a single organization was a threat to the classroom (nine), individual work (five), and use of pen and generalization to other organizations. Very few experiments paper (six). (two) explicitly described the lack of random sampling as a A major finding is that the reporting is vague and threat to validity. unsystematic. The community needs guidelines that TABLE 17 Threats to External Validity
  • 18. 750 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 9, SEPTEMBER 2005 provide significant support for how to draw conclusions report three experiments. We detected one case of near from the experimental results and on how to address article duplicates in different journals. The structure of the threats to internal and external validity and their con- database is designed to handle duplication, but a threat sequences. would be that duplication goes undetected. However, at least three people have read through all relevant articles 11 THREATS TO THE VALIDITY OF THIS SURVEY without detecting further duplicates. The main threats to validity for this study are publica- 11.3 Data Extraction tion selection bias, inaccuracy in data extraction, and The data was extracted from the articles independently by misclassification. two researchers. The interrater agreement varied from 11.1 Selection of Journals and Conferences 73 percent to 100 percent. Disagreements were resolved We consider the 12 surveyed journals and conferences to be by discussion and, when necessary, by involving other leaders in software engineering in general and empirical project members. Data extraction from prose is difficult at software engineering in particular. (Our selection of the outset and the lack of standard terminology and journals is a superset of those selected by others, as shown standards for reporting experiments in software engineer- in Table 1.) Nevertheless, a systematic survey that included, ing may have resulted in some inaccuracy in the data. in addition, gray literature (theses, technical reports, work- ing papers, etc.) describing controlled experiments in 11.4 Classification to Topics software engineering would, in principle, provide more The classification of articles to topics was done in two steps. data and allow more general conclusions to be drawn [29]. First, the articles were classified automatically on the basis of title, list of keywords, and registered treatment. Then, 11.2 Selection of Articles this classification was double-checked by two researchers. To help ensure an unbiased selection process, we defined The interrater agreement between the algorithm and the research questions in advance, organized the selection of two researchers was 75 percent for the comparative articles as a multistage process, involved several researchers classification using Glass et al.’s scheme and 66 percent in this process, and documented the reasons for inclusion/ for the IEEE-classification. The topic classification was exclusion as suggested in [29]. difficult due to the lack of a well-defined method of The initial investigation of the titles and abstracts of classifying according to the schemes used. 5,453 articles resulted in 140 survey articles. Based on recorded comments, 80 of these were reanalyzed by one or two other researchers and discussed in the project group. 12 SUMMARY Seventeen further articles were then excluded because they This paper reported a survey that characterized quantita- described studies without a treatment. Moreover, three tively the controlled experiments in software engineering articles were found to be exploratory, observational, or constituting a prestudy. Eight were found to fall outside the published in nine journals and three conference proceed- field of software engineering, five were excluded on the ings in the decade from 1993 to 2002. Included were grounds that they were summary articles, while four randomized experiments or quasi-experiments in which articles described multiple case studies. We used Inspec individuals or teams (the experimental units) applied a and various search engines to check the completeness of our process, method, technique, language, or tool (the treat- inclusion and cross-checked for inclusion with other ments) to conduct one or more software engineering tasks. surveys [51], [22], [25]. Still, the process was difficult and Out of 5,453 articles scanned, we identified 103 articles that we may not have managed to detect all articles that we reported 113 controlled experiments. would have liked to include. Although as many as 108 institutions from 19 countries Another challenge was that there is no keyword were involved in conducting the experiments, a relatively standard that we are aware of that distinguishes between low proportion of software engineering articles (1.9 percent) methods in empirical software engineering and that could report controlled experiments, given that controlled experi- be used to extract controlled experiments in a consistent ments is the classical scientific method for identifying cause- manner. For example, none of the selected articles matched the IEEE keyword taxonomy; indeed, this taxonomy has no effect relationships. One reason may be the large effort and appropriate keywords for the methods of empirical soft- resources needed to run well-designed experiments. ware engineering. (MIS Quarterly has ceased to use their An important issue that pertains to all software engineer- keyword classification scheme due to the presence of full- ing research is its relevance to the industry. For experiments, text search engines and the difficulty of keeping keyword both the topics under investigation and how representative classification schemes up to date [45].) of an industrial setting an experiment is will influence Moreover, article and experiment duplication is a industrial relevance. The two major areas investigated in the potential threat to frequency counts and the statistics in experiments were inspection techniques and object-oriented this survey. Among the 113 experiments covered in the design techniques. This survey also gave some indications 103 articles, 109 are reported in one article, two are reported as to how realistic the experiments were relative to the in two articles, one is reported in three articles, and one is subjects that took part, the tasks they performed, the types of reported in four. Among the 103 surveyed articles, 91 report applications on which these tasks were done, and the a single experiment, seven report two experiments, and five environment in which the subjects worked.
  • 19. SJØBERG ET AL.: A SURVEY OF CONTROLLED EXPERIMENTS IN SOFTWARE ENGINEERING 751 TABLE 18 Total Number of Articles Investigated In total, 5,488 subjects participated in the experiments. experiments and regarding treatment in only a few. Threats The number of participants ranged from 4 to 266 with a to internal validity regarding selection and instrumentation mean value of 49. In total, 87 percent of the subjects were were most frequently reported. students, whereas only 9 percent were professionals. This A major finding of this survey is that the reporting is indicates that one may question how representative the often vague and unsystematic and there is often a lack of experimental results are for an industrial setting. consistent terminology. The community needs guidelines The same applies to the kind of application used in the that provide significant support on how to deal with the experiments. In 75 percent, the applications were con- methodological and practical complexity of conducting and structed for the purpose of the experiment or constituted reporting high-quality, preferably realistic, software engi- student projects. Commercial applications were used in neering experiments. We recommend that researchers 14 percent of the experiments. should accurately report the following: the type and Threats to internal and external validity were addressed number of subjects, including the mortality rate; context in, respectively, 63 percent and 69 percent of the experi- variables such as general software engineering experience ments. Among the threats to internal validity, about 1/5 and experience specific to the tasks of the experiments; how were not handled, 3/5 were reduced, and 1/5 were the subjects were recruited; the application areas and type eliminated. This could either mean that the experiments of tasks; the duration of the tasks; and internal and external all over had a high degree of internal validity or that the validity of the experiments, including being specific about internal threats that were not reduced or eliminated were the sample and target population of the experiment. A more underreported. Threats to external validity regarding uniform way of reporting experiments will help to improve subject and task were discussed in more than half of the the review of articles, replication of experiments, meta- experiments, regarding environment in about 1/4 of the analysis, and theory building. TABLE 19 Number of Articles that Report Controlled Experiments