Empirical Methods in Software Engineering - an Overview

Empirical Methods in
Software Engineering
Alessio Ferrari, ISTI-CNR, Pisa, Italy

alessio.ferrari@isti.cnr.it
April, 2020

What is Software
Engineering?
• Software engineering it the systematic design and
development of software products and the management
of the software process.

• Software engineering has as one of its primary objectives
the production of programs that meet speciﬁcations, are
demonstrably accurate, produced on time, and within
budget.
cf. D. O’Neil, 1980. https://doi.org/10.1147/sj.194.0421

The Scope of Software Engineering (SE)
customer / user analyst
developer
requirements
software
system designer
tester
apparently good
softwarecustomer / user
customer / user
customer
service / analyst
requirements
maintainer /
system designer

Typical SE Problems
How can I ﬁnd bugs in my code?
How can improve software
development speed?
How can I reduce the resources
dedicated to testing?
How can improve my
requirements?

Typical SE Solutions (when we were not Empirical)
development speed?
How can improve my
requirements?
This new testing
environment will allow you
to ﬁnd all the bugs
Let us use this
new prototypical programming
language
We can use a controlled language
Let’s use model checking

Typical SE Failures
development speed?
How can improve my
requirements?
This new testing
environment will allow you
to ﬁnd all the bugs
Let us use this
new prototypical programming
language
We can use a controlled language
Let’s use model checking
It is very complex! I need to re-
train all my team!
The language does not cover my
real cases!
The language does not allow me
to express what I want!
I need training! It takes too long!
Language is too strict!

The Software Engineer’s Illusion
We though our tiny solutions would scale up to (ALL)
real-world cases, we thought a successful simple
example was sufﬁcient to ensure that our idea was
working…
We were convinced that we could smoothly pass from theory to practice
Problem
Solution
We thought we could change the world without knowing the world

The Hard Truth
• Of course, we were wrong…

• Software development is a complex, context-dependent phenomenon involving
multiple stakeholders, professionals, needs, technologies, domains (aircraft software,
mobile app to track your diet, enterprise software to manage workflow, you name it…)

• You rarely start the development from scratch (there may be legacy systems to
refactor, need to interact with external systems and databases)

• Even if the software to be developed is new, developers have specific backgrounds
and skills that have an impact on the development

• You rarely know how the project will go, as the context is surely going to change
throughout the project

• We thought it was too simple, and rarely SE solutions came from research, as SE
researchers we felt useless…
We understood that, to change the world,
we should first learn about the world

Empirical Software
Engineering Research
• The use of a (not “the”) scientific method to investigate software
engineering problems (there is no “official” scientific method)

• Start from observation, formulate hypothesis, select methodology,  
validate hypothesis with respect to reality

• The simple idea is that if I understand how things work in practice, I can
find ways to improve them
• Knowledge and understanding is not regarded as a final goal, but a
means to an end, where the end is solving real-world problems (within
the scope of software engineering)

• Scientifically evaluating whether my solution has solved the problem is
also empirical software engineering

Typical Empirical Software
Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
Theory
Space
Reality
Space

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
I have a lot of bugs
in this software
People work hours and hours,
but still, lots of bugs

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
I have a lot of bugs
in this software
People work hours and hours,
but still, lots of bugs
bugs may be produced
by too many work hours
bugs may be associated
to complex code

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
bugs may be produced
by too many work hours
I see that most bugs are
introduced
between 8pm and 8am
bugs may be associated
to complex code
No relation with code
complexity

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
I see that most bugs are
introduced
between 8pm and 8am
No relation with code
complexity
Install a system
that prevents developers from
working at night

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
Install a system
that prevent developers from
working at night
I reduced the number of bugs!

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
I reduced the number of bugs!
I cannot meet
the delivery deadlines!

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
I cannot meet
the delivery deadlines!
Development speed may
be lower during the day

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
Development speed may
be lower during the day
More correct code
during the day, but
slower speed
More bugs during the
night, but faster speed

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
More correct code
during the day, but
slower speed
More bugs during the
night, but faster speed
Dedicate more testing
resources for code
developed at night

Engineering Cycle
Observe
Reality
Formulate
Problem Theory
Evaluate Theory
Against Reality
Formulate
Solution Theory
Evaluate Solution
Against Reality
Dedicate more testing
resources for code
developed at night
Less bugs
More software

Typical SE Problems
development speed?
How can improve my
requirements?
Problems are the same…

Typical SE Solutions (Today)
The way of approaching problems
is different…
Let’s see how you write
requirements now
Let’s see what are
the main quality problems of your
requirements
Solutions are more context-speciﬁc,
start from reality,
and use a scientiﬁc method
How can improve my
requirements?
I see your
requirements language is clear, but
requirements are just incomplete! How many
meetings do you normally do with your
customer?
… Ok, let’s try to
schedule a meeting each week to
revise the requirements with the
customer
But WHICH method?

Software Engineering as a
Strange Creature
but also a human
and social facet
It has a technical facet

Come from hard sciences
(mostly quantitative)
…but also from
social sciences (mostly
qualitative)
experiment
with human subjects
literature review /
archival analysis
interview
and ethnography
survey
case study ﬁeld study
experiment
with software subjects
judgment study

Goal and Scope
of this Course
• To learn a set of methods commonly used in empirical
software engineering research

• To learn when to use a certain scientiﬁc method

• To learn how to combine diﬀerent methods
Remember that all methods are FLAWED!

All Methods Are Flawed!  
(from Steve Easterbrook)
• Experiments
• Real-world environment is simpliﬁed, as I have to focus only on a speciﬁc set of
variables (independent, dependent, controlled variables, we will see them later)

• Surveys
• People tell you what they think, not what they do, and it is hard to be sure that they
have correctly understood the questions

• Interviews and Ethnography
• Unavoidable researcher bias, as theories are derived from qualitative data

• Case Studies (which could include (quasi-)experiments, surveys, interviews, etc.)
• Hard to generalise and hard to separate environment from unit of analysis, as case
studies are real-world experiences (several confounding variables) in one or a few
companies (limited scope)
http://www.cs.toronto.edu/~sme/CSC2130/
Never stick to methodological purity!
WARNING: there is no acknowledged taxonomy for these methods!

Course Outline
• Overview of Empirical Methods

• Interviews and Ethnography

• Surveys

• Systematic Literature Reviews

• Qualitative Data Analysis Methods

• Experiments, Quasi-experiments and Hypothesis Testing

• Mining Software Repositories

• Case Studies and Action Research

Roles and Tasks in
These are the things that you will study
as an Empirical Software Engineer

Roles in SE
• Board of Directors: A group of people, elected by stockholders, to establish corporate policies, and make
management decisions (can also be a single person in case the co)
• Managers: three diﬀerent levels of management may be present in a large company (low, middle, top)

• Top-level managers are responsible for controlling and overseeing the entire organization.

• Middle-level managers are responsible for executing organizational plans which comply with the company’s
policies. These managers act at an intermediary between top-level management and low-level management.

• Low-level managers focus on controlling and directing. They serve as role models for the employees they
supervise.

• Customers: the ones who buy the system
• Users: the ones who use the system
• Requirements/Business Analysts: the ones that gather requirements from customers and users
• Designers and Architects: the ones that design the system at the high level
• Developers: the ones who code
• Testers: the ones who test the code

Roles in SE



supervise.

Companies may include only a subset of the roles

Roles in SE



supervise.

Some roles may be covered by the same person

Roles in SE



supervise.

The roles may depend on the adopted software process!
Some roles may be covered by the same person

(Main) Tasks in Software Engineering
• Requirements Elicitation and Analysis

• Software Architecture

• Software Development

• Software Testing

• Software Documentation

• Software Maintenance

• Software Process Management

Formulating Research
Questions
cf. R. Feldt http://www.robertfeldt.net/advice/guide_to_creating_research_questions.pdf

Research Question(s)• Every research endeavour starts with a question about the world: a problem to solve, a curiosity about some
observed fact (subconsciously related to something relevant that you may not be able to always articulate, e.g.,
why do developers prefer to work at night? —why are you asking this question? Because it’s interesting, but why
it is so?), a curiosity about some unknown fact (which are the most frequent defects in opensource code?)

• The research question is the inquiry that guides your research:

• e.g., Which are the most frequent defects in code developed by people with less than 6 months experience?
Which are the most frequent defects in code developed by people with 6 months to 3 years experience? […]
• You normally structure your research and reporting according to one or more research questions: they help to
clarify your GOAL to the reader but also TO YOU

• If you have more than one research question, it is good to establish a general research question (or research
objective):

• e.g., (mainly considering HOW aspects) To which extent certain defects types are related to the degree of
experience of the developer?

• e.g., (a more general one, may include also WHY aspects) Which is the relationship between defect types and
degree of experience of the developer?




objective):


Many times a clear formulation of the general research question
comes AFTER the formulation of the more speciﬁc research questions




objective):


Many times a clear formulation of the general research question
comes AFTER the formulation of the more speciﬁc research questions
TIP: sometimes you can formulate the general research question
as a Research Objective, e.g.: Understanding to which extent certain defect types are
related to the degree of experience of a developer

Types of Research Questions (from Robert Feldt)
Research
Questions (RQs)
Solution-focused Knowledge-focused
Creating Reﬁning Exploratory Base-rate
Existence Descriptive Comparative
Relationship
Frequency Process Existence Causality
Comparative Context
http://www.robertfeldt.net/advice/guide_to_creating_research_questions.pdf

Research
Questions (RQs)
Relationship
Comparative Context
if not much is known about
the phenomenon under study,
we want to create tentative theories,
and give some evidence that a certain phenomenon
can be measured
(e.g., To which extent do developers get tired of coding?)

Research
Questions (RQs)
Relationship
Comparative Context
describe when and how the phenomenon
under study appears (normal patterns), when we already
have a well-deﬁned problem and context
(e.g., When do developers get tired of coding?)

Research
Questions (RQs)
Relationship
Comparative Context
describe how the phenomenon
under study relates to other phenomena
(e.g., Why do developers get tired of coding?)

Research
Questions (RQs)
Relationship
Comparative Context
describe better ways to solve problem or situation
Which strategies help to achieve X?
How can we refine S to achieve X in a better way?

“How can we reﬁned S to achieve X in a better way?”
Sub-Types of RQs Examples
Exploratory/Existence “Does X exist?”,
“Is Y something that software engineers really do?”
Exploratory/Descriptive “What is X like?”,
“What are its properties/attributes?”,
“How can we categorize/measure X?”,
“What are the components of X?”
Exploratory/Comparative “How does X differ from Y?”
Base-rate/Frequency “How often does X occur?”,
“What is an average amount of X?”
Base-rate/Process “How does X normally work?”,
“What is the process by which X happens?”,
“In what sequence does the events of X occur?”
Relationship/Existence “Are X and Y related?”,
“Do occurrences of X correlate with Y?”
“What correlates with X?”
Relationship/Causality “What causes X?”,
“Does X cause Y?”,
“Does X prevent Y?”,
Causality/Comparative “Does X cause more Y than Z does?”,
“Is X better at preventing Y than Z is?”
Causality/Context “Does X cause more Y under one condition than others?”

Creating Research Questions
• Select overarching research topic (e.g., software development speed)

• Do you want to create more and better understanding (Knowledge-based), 
or are you seeking for a solution to a problem (Solution-based)?

• (Knowledge-based) e.g., what affects development speed?  
how can we measure development speed?
• How much is known about the topic?

• Not much (Explorative): how can we measure development speed?

• We know the phenomenon, but not how or when it occurs (Base-rate): what is the
development speed of agile teams?
• We know the phenomenon, but not its causes (Relationship): what affects development
speed?
• (Solution-based) e.g., how can I improve development speed? what is the easiest way to improve
development speed?
• One research question is not sufficient and you need a combination of them, so try to find a main
research question or objective, and identify sub-questions, e.g., by checking the types of questions in
the previous table and adapting them to your problem

Data Types,
Measure, Scales
cf. Wholin et al., 2012, https://doi.org/10.1007/978-3-642-29044-2


experiment
with human subjects
literature review /
archival analysis
interview
and ethnography
survey
experiment
judgment study

experiment
with human subjects
literature review /
archival analysis
interview
and ethnography
survey
experiment
judgment study
Empirical inquiries entail OBSERVATION

experiment
with human subjects
literature review /
archival analysis
interview
and ethnography
survey
experiment
judgment study
Empirical inquiries entail OBSERVATION
Regardless of the research method
you use, you will need to collect
and analyse data

Qualitative and Quantitative
Types of Data
• As you are doing empirical research, you will need to
collect data, regardless of the method you use

• Qualitative data (aka WORDS) come from interviews,
surveys but also from other sources that may be relevant
for SE, such as social-media opinions, code comments,
app reviews

• Quantitative data (aka NUMBERS) come from
measurements (also done on qualitative data)

Measure
• A MEASURE is a mapping from the attribute of an entity to a
measurement value, which can be numerical or categorical (a label)

• Entities are objects we can observe in the real world, and have attributes

• entity: source code;

• attribute: complexity;

• measure A: lines of code; value A: 1000

• measure B: evaluation made by user; value B: “very complex”

• The purpose of mapping the attributes into a measurement value is to characterize
and manipulate the attributes in a formal way.

• To be valid, the measure must not violate any necessary properties of the attribute it
measures and it must be a proper mathematical characterization of the attribute (if
code X is more complex than code Y, this should be reﬂected in the measure)

Scale
• A mapping of an attribute to a measurement value can be
done in diﬀerent ways, and each way is a scale

• Complexity can be measured in lines of code (LOC) or in
“evaluation made by user”, these are diﬀerent scales
Entity
(Source Code)
Attribute
(Complexity)
Measure
Measurement
Value (1000)
Scale
(LOC)

Scale Types (Level of Measurement)
• Nominal (named values): maps the attribute of the entity into a name or symbol; can be
seen as a form of classiﬁcation of the attribute (e.g., types of code defects)

• Ordinal (named and ordered values): the ordinal scale ranks the entities after an
ordering criterion (“greater than”, “better than”, and “more complex”), (e.g., catastrophic,
critical, marginal, negligible risk)
• Interval (named, ordered and proportionate intervals): the interval scale is used when
the diﬀerence between two measures are meaningful, but the value itself is not
meaningful.

• This scale type orders the values in the same way as the ordinal scale but there is a
notion of “relative distance” between two entities

• Rare in SE, temperature in Celsius is a typical interval scale, but you can set up a
scale like the IQ (Intelligence Quotient) also in SE (e.g, usability scale based on a test)
• Ratio (named, ordered, proportionate intervals, have a meaningful zero): if there
exists a meaningful zero value (negative values do not exist) and the ratio between two
measures is meaningful (e.g., lines of code is a ratio scale)

Scales: Time and Duration
• (Clock) Time and Duration: what types of scale are them?
• Time is an interval scale
• Duration is a ratio scale
• Time is an interval measure when using any standard calendar and time
measurement system as there is no ﬁxed start point

• 2018/10/23:20:10 CE and 2018/10/23:20:20 CE; there is a 10 second gap but
the latter is not twice the former and there is no meaningful 0

• Duration (the amount of time something takes) is a ratio measure as it has a
meaningful zero

• 20 seconds is twice as long as 10 seconds and 10 days is twice as long as 5
days.
cf. https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/

Scale Types and Power
cf. https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/
Different scale types imply different allowed operations

Measure Types
• Objective: an objective measure is a measure where there is no judgement in the
measurement value and is therefore only dependent on the entity that is being
measured.

• An objective measure can be measured several times and by diﬀerent researchers,
and the same value can be obtained within the measurement error.

• Subjective: a subjective measure is the opposite of the objective measure. The person
making the measurement contributes by making some sort of judgement. The measure
depends on both the entity and the viewpoint from which they are taken.

• A subjective measure can be diﬀerent if the entity is measured again. A subjective
measure is mostly of nominal or ordinal scale type.
• Direct: does not involve measurements on other attributes (e.g., LOC).
• Indirect: is derived from the other measurements of other attributes, possibly involving
more than one entity (e.g., defect density, productivity).

Measurements in SE
• In SE we normally measure three classes of entities

• PROCESS: The process describes which activities that
are needed to produce the software.

• PRODUCT: The products are the artifacts, code,
deliverables or documents that results from a process
activity.

• RESOURCES: Resources are the objects, such as
personnel, hardware, budget, needed for a process
activity.

Measurements in SE
• Relevant measures in SE are often indirect, subjective
and are normally expressed in nominal or ordinal scale

• Most of the times we want to link some internal attribute
(e.g, code size, colours of the GUI) to an external one
(e.g., perceived complexity, usability)

• In principle we could not apply advanced statistical
analysis when we deal with these measures…however,
we do it anyway (but we should always reﬂect on the risks
and on the value of our conclusions)
cf. Briand et al. https://doi.org/10.1007/BF00125812

Data Collection
Techniques in SE
cf. Lethbridge et al., 2005, https://doi.org/10.1007/s10664-005-1290-x

Data Collection
• Before measuring you need to collect data that can be relevant to your research
questions

• Depending on the question, you may need diﬀerent data collection techniques

• Normally, the data collection technique is also driven by the context that you CAN
access:

• are you in contact with a company and can you interview people? —interview

• are you in contact with the company and you can create meetings with their
developers? —focus group

• you do not have any direct contact with companies, but you can reach some
people? —questionnaire

• you need to compare the performance of diﬀerent tools, which licenses do you
have? Can you buy them? —static and dynamic analysis of a system
Reaching the (RIGHT) source of information is one of the hardest part…

Data Collection Techniques
requires direct access to a participant population. Second degree contact requires access
Table 1. Data collection techniques suitable for ﬁeld studies of software engineering.
Category Technique
Inquisitive techniques
First Degree
(direct involvement of software engineers)
& Brainstorming and Focus Groups
& Interviews
& Questionnaires
& Conceptual Modeling
Observational techniques
& Work Diaries
& Think-aloud Protocols
& Shadowing and Observation Synchronized Shadowing
& Participant Observation (Joining the Team)
Second Degree
(indirect involvement of software engineers)
& Instrumenting Systems
& Fly on the Wall (Participants Taping Their Work)
Third Degree (study of work artifacts only) & Analysis of Electronic Databases of Work Performed
& Analysis of Tool Use Logs
& Documentation Analysis
& Static and Dynamic Analysis of a System
DATA COLLECTION METHODS 313

Data Collection Techniques
Humans tend not to be reliable reporters, as they often do not remember past events
with a high degree of accuracy. Records of activities, such as tapes, work products, and
repositories, tend to be more reliable. However, care must be taken when interpreting
these data sources as they may not be consistent, internally or with each other.
Despite their drawbacks, first degree techniques are invaluable because of their
flexibility and the phenomenon they can be used to study. Existing logs and repositories
are easy to use but the data available is highly constrained. Software engineers, on the
other hand, can be asked about a much wider range of topics. Second degree techniques
Figure 1. Cost, reliability, flexibility, and phenomena addressed.
NO INVOLVEMENT,
e.g., analysis
of data logs
INDIRECT involvement
of people,
e.g., instrumentation
DIRECT involvement of people
e.g., brainstorming,
questionnaires
people intensive data intensive

1st Degree Techniques:
Direct Involvement
• Inquisitive:
• Brainstorming / Focus Groups

• Interviews

• Questionnaires / Surveys

• Conceptual modelling

• Observational:
• Think Aloud

• Shadowing / Observation

• Participant Observation

Brainstorming
and Focus Groups
• What are they: based on a simple trigger question, people are free to express whatever comes to their
mind, initially on paper, and then take turns to speak.

• Advantages:
• new to a domain and seeking ideas for further exploration

• rapidly identifying what is important to the participant population

• sense of involvement in the research
• Disadvantages:
• can become unfocused

• hard to schedule with busy developers (you need to stop the activity of many people)

• Example: understanding factors leading to success and failure of software process improvement.
Researchers involved 13 software companies and implemented 49 focus groups. The groups were
comprised of between 4 and 6 participants. Each session lasted 90 minutes. There were three types of
groups: senior managers, project managers, and developers. The focus groups were moderated and
tackled very speciﬁc questions aimed at understanding several factors leading to success and failure
for software process improvement.

•

Interviews
• What are they: ask a series of questions to some relevant actor of the software process
• Advantages:
• People are familiar with question-answering

• People tend to be happy when someone asks about them

• Create rapport with people

• Possibility to clarify

• Disadvantages:
• People are not always reliable, and this can bias the results

• Difficulties in sampling (random sampling often not applicable)

• Time consuming: scheduling, data transcription, etc.

• Example: Study the design process used on 19 different projects at various organizations. They
interviewed personnel from three different levels of the participating projects, systems engineers,
senior software designers and project managers. The researchers conducted 97 interviews, which
resulted in over 3000 pages of transcripts of the audio recordings.

Questionnaires and Surveys
• What are they: written pre-defined questions to be answered by people.
• Advantages:
• quick and easy to administer

• reach more people

• Disadvantages:
• difficult to clarify questions and answers

• return rates can be low (10% normally, 20% if you’re lucky)

• Example: paper-based questionnaire to identify factors affecting a certain
tool adoption in 52 organizations. The author contacted organizations who
had purchased the tools and surveyed key information systems personnel
about the use of the tool.
Surveys and questionnaire are treated as synonyms here

Conceptual Modelling
• What is it: participants create a diagram of some aspect of their work, often a system architecture
or organisational structure or process. The intent is to bring to light their mental models.
• Advantages:
• easy to collect (drawing)

• can explain systems that are hard to understand otherwise

• Disadvantages:
• require domain knowledge to be interpreted

• can be hard to convince the engineers or other subjects to draw details

• Example:
• Identify the process in terms of tools, actors and tasks, for performing reimbursement of
expenses in a public administration oﬃce. The goal was to re-engineer the process. Interviews
with personnel to gather information, graphical diagrams shown to the personnel, and
validation of the diagrams.
Conceptual modelling requires interviews or focus groups

Work Diaries
• What are they: require participants to record various events that occur during the day. Filling out a
form at the end of the day, recording specific activities as they occur, or noting whatever the current
task is at a pre-selected time.
• Advantages:
• better self-reports of events because they record activities on an ongoing basis rather than in
retrospect

• you can randomly sample work diary moment

• Disadvantages:
• you need to convince people

• can interfere with respondent as they work (recording can affect the work)

• people could neglect to record

• Example: I want to know which are the communication patterns (who do you contact, and about
what) in a company. I ask developers to record their communication patterns for a period of one
week. Identification of the interaction between the team members, and the typical communication
patterns of developers.

Think Aloud
• What are they: researchers ask participants to think out loud while performing a task.
As software engineers sometimes forget to verbalize, experimenters may occasionally
remind them to continue thinking out loud. Usually last no more than 2 hours.

• Advantages:
• One of the few ways to test a cognitive model

• Easy to implement

• You can also ask to write down

• Disadvantages:
• Diﬃcult and time consuming to analyse output

• Example: I want to understand the strategy used by developers when debugging. I
give a certain piece of software, ask them to add some functions, the system will return
an error, and then I ask them to debug the code and think aloud about what they do.
•

Shadowing/Observation
• What are they: with shadowing the experimenter follows and observe the
participant and records their activities. With observation, I follow and observe more
participants (e.g., in meetings).
• Advantages:
• Easy to implement

• No special equipment needed

• Disadvantages:
• Know just the general, observable activity

• Need to know the environment and domain very well

• Can be annoying for people, and could bias their behaviour

• Example: I want to monitor informal communication in the group, and I observe an
open development space for a certain amount of days.

Participant Observation (Join the Team)
• What are they: the researcher joins the development team and perform some activities like the
others.
• Advantages:
• More acceptance by the participants

• Deeper understanding of the dynamics

• Create rapport with people

• You can contribute to the team

• Disadvantages:
• Extremely time consuming (it’s an additional job)

• May lose external perspective

• Example: Over 17 months, a researcher participated in 23 code inspection meetings. From his
participation, he developed a series of hypotheses on how factors such as familiarity,
organizational distance, and physical distance are related to how much time is spent on
discussion and tasks.

2nd Degree Techniques:
Indirect Involvement
• Instrumenting Systems

• Fly-on-the-Wall: participants recording their own work
The researcher needs to have contact
with the research environment and with the participants, but:
1. does not need to interact with them during data collection,
and 2. not much effort is required to participants

Instrumenting Systems
• What is it: monitor developer-system interaction during a certain task, e.g., with eye tracking, cameras,
wristband, or add-on tools for logging.

• Advantages:
• No time commitment for software engineering (unless you carry out an experiment)

• Accurate information

• Disadvantages:
• Data are “raw” and do not have a clear meaning

• Ethical concerns in monitoring users

• Example: I want to monitor the degree of engagement of software developers in a company. I ask them
to use wristbands during their day to record their engagement (sensed engagement). I instrument their
computers with a logger to check what they are doing. I ask them to write down their degree of
engagement every 30 minutes (working diary, reported engagement). I check to which extent the two
measures (sensed engagement and reported) are in agreement, and what were the developers doing.

Fly-on-the-Wall
• What is it: participants are required to record or videotape themselves when they do a
specific task.

• Advantages:
• Little effort required by the participant

• No direct interaction with the researcher

• Disadvantages:
• High amount of data and high cost for analysing them

• Videos are multi-modal data and analysing them is not straightforward

• Not always easy to understand the content of videos

• Example: I ask the team to video tape each meeting they do for a certain period (e.g.,
an iteration). I review the recording to see specific patterns of interaction, and the roles
of the people.

3rd Degree Techniques
aka Mining Software Repositories
• Analysis of electronic database of work performed /
Analysis of tool logs

• Document Analysis such as code documentation and
other software related documents

• Static and Dynamic analysis of a system (Software
Analytics)
Require access only to work artefacts, such
as source code or documentation

3rd Degree Techniques
aka Mining Software Repositories
• Analysis of electronic database of work performed /
Analysis of tool logs

• Document Analysis such as code documentation and
other software related documents

• Static and Dynamic analysis of a system (Software
Analytics)
Require access only to work artefacts, such
as source code or documentation
In recent years, with the development of shared repositories, such
as GitHub, these data collection activities go under the name
Mining Software Repositories

Analysis of electronic database
of Work Performed and Tool Logs
• What is it: access to the platforms for issue or bug reporting (e.g., Bugzilla), change request, configuration
management systems, version control systems (e.g., git)
• Advantages:
• Large amount of data

• Stable and independent of the researcher

• People do not need to do extra work

• Disadvantages:
• Too much data!

• Limited knowledge of work environment

• People do not necessarily fill all the information needed (e.g., in commit messages)

• Different process management policies in different companies, and this may impact on the data

• Example: I want to understand which are the typical patterns of software evolution. I analyse the change
requests and commits in a certain software repository and check, e.g., when are they typically performed,
by whom, and if there is a typical sequence of actions.

Document Analysis
• What is it: analysis of documents related to the software process, such as code comments,  
e-mails, stack overﬂow, twitter, app review, developer’s documentation, users’ manual, etc.
• Advantages:
• Large amount of data in natural language (English, Italian, German, etc.)

• Written information can answer why questions

• Researcher’s independent

• Disadvantages:
• Requires knowledge of the context

• Natural language processing (NLP) techniques needed for large amount of data

• Data are often “dirty”

• Example: I want to understand whether the app reviews on the Apple Store actually contain
potential new requirements for the app. I ask some subjects to check a certain amount of reviews,
identify requirements, and check their agreement (I can also decide to automatically predict
whether a certain review includes a requirement or not, based on the manually checked reviews).

Static and Dynamic Analysis of
a System (Software Analytics)
• What is it: analyze the code (static analysis) or traces generated by running the code (dynamic analysis) to
learn about the design, and indirectly about how software engineers think and work. One might compare the
programming or architectural styles of several software engineers by analyzing their use of various constructs,
or the values of various complexity metrics.
• Advantages:
• Large amount of data

• Researcher’s independent

• Analysis tools are emerging (https://github.com/ishepard/pydriller, https://github.com/uni-bremen-agst/
libvcs4j, https://ghtorrent.org)

• Disadvantages:
• Source code is not always easy to understand

• Dynamic behaviour is even more diﬃcult

• Need to resort on automatic support

• Example: I want to check which are the most frequent dynamic errors triggered by software in GitHub.  
I download a selection of representative projects, and analyse them with an abstract interpretation tool, and
see which are the typical errors.

Data Collection Techniques: Summary
Table 2. Questions asked by software engineering researchers (column 2) that can be answered by field study techniques.
Technique Used by researchers when their goal is to understand: Volume of data
Also used by software
engineers for:
First Order Techniques
Brainstorming and Focus Groups Ideas and general background about the process and product,
general opinions (also useful to enhance participant rapport)
Small Requirements gathering,
project planning
Surveys General information (including opinions) about process,
product, personal knowledge etc.
Small to Large Requirements and evaluation
Conceptual modeling Mental models of product or process Small Requirements
Work Diaries Time spent or frequency of certain tasks (rough approximation,
over days or weeks)
Medium
Think-aloud sessions Mental models, goals, rationale and patterns of activities Medium to large UI evaluation
Shadowing and Observation Time spent or frequency of tasks (intermittent over relatively
short periods), patterns of activities, some goals and rationale
Small Advanced approaches to use
case or task analysis
Participant observation
(joining the team)
Deep understanding, goals and rationale for actions, time spent
or frequency over a long period
Medium
Second Order Techniques
Instrumenting systems Software usage over a long period, for many participants Large Software usage analysis
Fly in the wall Time spent intermittently in one location, patterns of activities
(particularly collaboration)
Medium
Third Order Techniques
Analysis of work databases Long-term patterns relating to software evolution, faults etc. Large Metrics gathering
Analysis of tool use logs Details of tool usage Large
Documentation analysis Design and documentation practices, general understanding Medium Reverse engineering
Static and dynamic analysis Design and programming practices, general understanding Large Program comprehension,
metrics, testing, etc.
DATACOLLECTIONMETHODS315
NOTE: “first order” in this table means “first degree”

Building Theories in
cf. Sjøberg et al., 2009 https://dx.doi.org/10.1007/978-1-84800-044-5_12
cf. Mendez Fernandez, https://www.slideshare.net/mendezfe/an-introduction-into-
philosophy-of-science-for-software-engineers



What is a Theory?
• A statement about the existence of some pattern in the
entities that belong to a certain context 
 
• The boundary of the context determines the scope of
applicability of the theory (e.g., all the people in a certain
company vs all the C developers of the world)
A theory exists where some form of REGULARITY can be identiﬁed
“entities” =
“observable phenomena” =
“events and objects”

What is a Theory?
• I can have different levels of sophistication of a theory, depending to how
much abstract are the entities considered (how far are them from direct,
measurable observations):

• Low: 90% of faults are found in functions that are longer than 1000
LOC (once a definition of fault is given, this can be verified quite
precisely)

• Medium: Requirements defects can be classified into unclarity,
incompleteness and incorrectness (I need precise definitions for the
three classes, I have to assess that all the existing defects can be
linked to one of the classes, I have to check that every reader classifies
in the same manner…verification of the theory is complicated)

• High: If the team leader is not self-confident, developers lose trust  
(I need measures for self-confidence and trust, verification of the
theory is VERY complicated)
In SE you will find all these different types of theory

What a Theory Does?
Description Explanation Prediction
Explanation
and
Prediction
Design
and
Action
descriptions
and conceptualisations
(taxonomies, ontologies,
e.g., defect types example)
identify the motivation
(e.g., team leader example)
predict according
to a model
(e.g., fault example)
ﬁnd model and motivation
prescriptive
(e.g., testing
resources
example,
initial slides)
cf. also https://www.quora.com/How-can-statistics-tell-us-about-causality

What are the Elements  
of a Theory?
• The elements of a theory can be framed according to 6 questions
What
How
Why
Where When
for
Whom
Scope conditions
What are the entities in terms of which a theory oﬀers description, explanation,
prediction or prescription?

These are the constructs of a theory.
How are the constructs related? Relationships between constructs make up a
theory’s propositions, and describe how the constructs interact.

Can lead to predictions

Why do the relationships hold? Answers to this question  
are what give the theory explanatory power
Identify the circumstances in which the theory  
is applicable (the context)

Constructs (WHAT)12 Building Theories in Software Engineering 323
Table 3 Constructs, propositions, example explanations and scope of the theory of UML-based
development
Constructs
C1 UML-based development method
C2 Costs (total number of person hours in the project)
C3 Communication (ease of discussing solutions within development teams and in reviews)
C4 Design (perceived structural properties of the code)
C5 Documentation (the documentation of the system for the purpose of passing reviews as
well as for expected future maintainability)
C6 Testability (more efficient development of test cases and better quality, i.e., better coverage)
C7 Training (training in the UML-based method before the start of the project)
C8 Coordination (of requirements and teams)
C9 Legacy code (code that has not been reverse engineered to UML-models)
Propositions
P1 The use of a UML-based development method increases costs
P2 The use of a UML-based development method positively affects communication
P3 The use of a UML-based development method positively affects design
P4 The use of a UML-based development method positively affects documentation
P5 The use of a UML-based development method positively affects testability
We will see more of this example later on

Propositions (HOW)
Propositions
P6 The positive effects of UML-based development are reduced if training is not sufficient
and adapted
P7 The positive effects of UML-based development are reduced if there is insufficient coordi-
nation of modelling activities among distributed teams working on the same project
P8 The positive effects of UML-based development are reduced if the activity includes
modification of legacy code
Explanations
E4 The documentation is
– More complete
– More consistent due to traceability among models and between models and code
– More readable, and makes it easier to find specific information, due to a common
format
– More understandable for non-technical people
– May be viewed from different perspectives due to different types of diagram
E5 Test cases based on UML models
– Are easier to develop
– Can be developed earlier

Explanation (WHY)
are specified further into propositions of the theory, as indicated in Fig. 3; the
and adapted
Explanations
– More complete
format
– Are more complete
– Have a more a unified format
Moreover, traceability from requirements to code and test cases makes it is easier to
identify which test cases must be run after an update
Scope
The theory is supposed to be applicable for distributed projects creating and modifying
large, embedded, safety-critical subsystems, based on legacy code or new code

Scope Conditions (WHEN,
WHERE, for WHOM…)
propositions P6–P8 are examples of moderators.
The scope of the theory is also illustrated in the diagram. Scope conditions are
typically modelled as subclasses or component classes. Figure 3 shows that our
format
Scope

Scope Conditions (WHEN,
WHERE, for WHOM…)
propositions P6–P8 are examples of moderators.
The scope of the theory is also illustrated in the diagram. Scope conditions are
typically modelled as subclasses or component classes. Figure 3 shows that our
format
Scope
This example theory answers all the questions,
but the theories you develop may answer
only a SUBSET of the questions
(e.g., WHY is left to other researchers)

How are Theories Formed?
Induction, Deduction, Abduction
Theory
Observation
Induction
Hypothesis
Deduction
Test
Theory
Abduction
Deduction

Theory
Observation
inference of a
generalized conclusion
from particular instances
Induction
Hypothesis
Deduction
Test
Theory
Abduction
Deduction

derive testable
hypothesis for a theory
Theory
Observation
inference of a
Induction
Hypothesis
Deduction
Test
Theory
Abduction
Deduction

Theory
Observation
inference of a
Induction
Hypothesis
Deduction
Test
Theory
Abduction
generalize from theories
Deduction

derive testable
hypothesis for a theory
Theory
Observation
inference of a
Induction
Hypothesis
Deduction
Test
Theory
Abduction
generalize from theories
Deduction

Criteria for Evaluating Theories12 Building Theories in Software Engineering 319
the presence of a falsifiable theory, which gives rise to hypotheses that are tested
by observation. Although this framework as such has been overtaken by other
frameworks (Ruse, 1995), the principle of testability remains fundamental for
empirically-based theories. There are no commonly agreed set of criteria for evalu-
ating testability, but we will emphasize the criteria as follows: (1) The constructs
Table 1 Criteria for evaluating theories
Testability The degree to which a theory is constructed such that empirical
refutation is possible
Empirical support The degree to which a theory is supported by empirical studies that
confirm its validity
Explanatory power The degree to which a theory accounts for and predicts all known
observations within its scope, is simple in that it has few ad hoc
assumption, and relates to that which is already well understood
Parsimony The degree to which a theory is economically constructed with a mini-
mum of concepts and propositions
Generality The breadth of the scope of a theory and the degree to which the theory
is independent of specific settings
Utility The degree to which a theory supports the relevant areas of the software
industry

To what extent does my theory explain WHY?

Step-by-Step guide to
Formulating Theories (Deductive)
1. Define constructs of the theory (can be novel constructs,
existing ones, or refinements of existing ones)

2. Define propositions (novel, modifications/refinements of
existing ones)

3. Provide explanations to justify the theory (explicit assumptions
and logical justifications for the constructs and propositions of
the theory, referring to existing theories, also from other
disciplines)

4. Define the scope of interest (values of constructs, and
combinations thereof, that the theory is oriented to explain)

1. Define constructs of the theory (can be novel constructs,
existing ones, or refinements of existing ones)

2. Define propositions (novel, modifications/refinements of
existing ones)

3. Provide explanations to justify the theory (explicit assumptions
and logical justifications for the constructs and propositions of
the theory, referring to existing theories, also from other
disciplines)

4. Define the scope of interest (values of constructs, and
combinations thereof, that the theory is oriented to explain)
Every time you are applying an empirical method
you are actually building theories

5. Test the theory through empirical research (examination of the validity of the theory’s
predictions through empirical studies):

1. Choosing an appropriate research setting and sample. The sample does not
only include the actors, but also the sample of technologies, activities (tasks)
and systems.

2. Operationalizing theoretical constructs into empirical variables (e.g., justify the
connection between complexity of software and its measure in lines of code)

3. Operationalizing theoretical propositions into empirically testable hypotheses
(deﬁnition of hypotheses in terms of empirical variables)

4. Application of qualitative or quantitative methods to test the hypotheses
(when speaking about hypothesis testing, we normally refer to quantitative
statistical tests, however the conceptual process is the same also for qualitative
methods)

6. Deﬁne scope of validity (part of the scope of interest in which the theory has actually
been validated)

Step-by-Step Graphical Guide (Deductive)
Theory
Operationalisation
(Variables
Hypothesis
and Sample Deﬁnition)
Data Collection and
Measurements
Data
Analysis
(Hypothesis
Testing)
Conﬁrm/Reject
and
Scope
of Validity

Scope of Validity
328 D.I.K. Sjøberg et al.
to the scope of interest. The first consideration to make in testing a theory is to
make sure that the study fits the theory’s scope of interest. Otherwise, the results
would be irrelevant to that theory. Moreover, in a given study, typically only a part
Fig. 4 Scope of interest versus scope of validity

Threats to Validity
• Empirical support (or inconsistencies) between theoretical
propositions and empirical observations do not necessarily imply that
the theory is validated (or disconﬁrmed)

• Judgements regarding the validity of the theory require that the study
is well conducted, and not encumbered with

• Invalid operationalization of theoretical constructs and propositions

• Inappropriate research design

• Inaccuracy in data collection and data analysis

• Misinterpretation of empirical ﬁndings

Threats to Validity
• Construct: have I operationalised all the constructs
correctly? (e.g., is LOC a good scale to measure
complexity?)

• Internal: are there aspects that may have influenced my
outcome and that I did not consider? (identify
confounding variables, e.g., did the people already see
the code they are evaluating?)

• External: to which extent are my findings generalisable
(how much of the scope of interest is covered, e.g., which
type of languages are considered?)
Each research method has specific classifications
for threats to validity, we will see them later,
here are three general notions

Theory
Operationalisation
Sample Deﬁnition, Data
Collection
Step-by-Step Graphical Guide (Inductive)
Often I do not have the information to identify
construct, propositions, and explanations
before data collection
Therefore I start with data collection!
Constructs, propositions and
explanations are extracted from the
data (normally QUALITATIVE)

Theory
Data Analysis and
Operationalisation
Sample Deﬁnition,
Data Collection
Step-by-Step Graphical Guide (Inductive to Deductive)
INDUCTIVE

Theory
Data Analysis and
Operationalisation
Sample Definition,
Data Collection
Step-by-Step Graphical Guide (Inductive to Deductive)
INDUCTIVE
Operationalisation
Sample Definition,
Data Collection
Data Analysis /
Hypothesis Testing
Refutation/
Confirmation
Scope of Validity
DEDUCTIVE

Generating a Theory — Inductive to Deductive
An Example
• Field study in a company to investigate beneﬁts and
challenges of the use of a UML-based method in a
large distributed development project
• Goal of the project: new safety-critical process-control
system based on several existing systems

• Four sites in three countries, 230 people, 100 using UML

• Data was collected through individual interviews,
questionnaires and project documents.

Generating a Theory: Example
• Step 1: Deﬁning the constructs
• Interviews are performed to identify which are the
most signiﬁcant concepts to consider. They applied
the so-called “open coding” to the interview transcripts
to identify the constructs

• Step 1: Deﬁning the constructs
• Interviews are performed to identify which are the
most signiﬁcant concepts to consider. They applied
the so-called “open coding” to the interview transcripts
to identify the constructs
12 Building Theories in Software Engineering 323
development
Constructs
Propositions

• Step 2: Defining the propositions
• From the interviews, relationships are identified between
constructs (e.g., relation between UML and cost), and these
are translated into propositions

• The resulting propositions are confirmed with questionnaires
12 Building Theories in Software Engineering 323
development
Constructs
Propositions
and adapted
Explanations

• Step 3: Provide explanations
• Further analyse the interviews to understand the reasons behind
the propositions

• Perform further interviews and check project documents to make
sense of identiﬁed phenomena
Propositions
and adapted
Explanations
– More complete
format
Scope

• Step 4: Identifying the scope of interest of the theory
• Technology: UML

• Actor: designers in distributed teams

• Software System: large, embedded software

• Activity: create and modify UML diagrams

• Step 5: Testing the theory - Deductive Step
• Consider each proposition and perform a study for each one, or for a subset, e.g., “Use of
UML methods increases cost”, “Use of UML methods positively affects testability”

• I can use different methods to test the theory:

• Field studies: identify companies who are willing to introduce UML; establish a way to
evaluate cost (e.g., man-hour); consider a comparable company not using UML; check
resulting cost.

• Experiment: two group of subjects; give them a requirements document; ask group 1 to
implement the code; ask group 2 to design and then implement; evaluate and compare
cost.

• Survey/Questionnaire: contact multiple companies who have introduced UML and ask
them to state their agreement with the propositions and the explanations

• Step 6: based on the selected study I identify the scope of validity (larger for survey, narrower for
field studies)

• Theory Evaluation
• Testability: constructs are not ambiguous, and propositions are clear,
furthermore protocols are shared for replication. Since some subjective data
collection was performed, replication may lead to diﬀerent results.
• Empirical support: other studies seem to conﬁrm part of the propositions
• Explanatory power: the motivations are derived from interviews, and not all
factors may have been considered. Hence the explanatory power is limited (did
not account for all possible reasons WHY).
• Parsimony: reduced number of constructs and relationships in the proposition
• Generality: scope is narrow, as I have performed a case study
• Utility: utility is high, as it can help decision making
These are all logical arguments
that have to be checked by peers!

The ABC of Software
Engineering Research
cf. Stol and Fitzgerald, 2018 https://doi.org/10.1145/3241743

The Need for a Taxonomy
of Methods Strategies
• As we said, there is no universally accepted taxonomy for
research methods in SE
The ABC of Software Engineering Research 11:3
Table 1. A “Mixed Bag”: Alternative Research Methods in Software Engineering
According to a Selection of Sources
Glass et al. [63] Zannier et al. [230] Sjøberg et al. [190] Höfer and Tichy [75] Easterbrook et al. [48]
Action research Controlled experiment Controlled experiment Case study Experimentation
Conceptual analysis Quasi experiment Surveys Correlational study Case study
Concept implementation Case study Case studies Ethnography Survey
Case study Exploratory case study Action research Ex post facto study Ethnography
Data analysis Experience report Experiment Action research
Discourse analysis Meta-analysis Meta-analysis
Ethnography Example application Phenomenology
Field experiment Survey Survey
Field study Discussion
Grounded theory
Hermeneutics
Instrument development
Laboratory experiment
(human/software)
Literature review
Meta-analysis
Mathematical proof
Protocol analysis
Phenomenology
Simulation
Descriptive/expl. survey
Each author use different terms to refer to
research methods, and there is no agreement
Let us talk about STRATEGIES,
which can adopt speciﬁc METHODS…

A Unifying Framework: ABC of SE Research
• Actors: human and technical, i.e., managers, software engineers, users,
software systems, software development artifacts incl. defects, tools,
techniques, prototypes

• Behaviour: of all actors, i.e., system behavior (e.g., reliability, performance,
and other quality attributes), software engineers’ behavior and antecedents
such as productivity, motivation, and intention

• Context: of all actors, i.e., industrial settings, organizations, software
projects, development teams, software laboratory, classroom, meeting rooms
"Optimizing a study to achieve generalizability over actors (A) and precise
measurement of their behavior (B), in a realistic context (C), is impossible, and is
a “three-horned dilemma [since] there is no way—in principle—to maximize all
three (conﬂicting) desiderata of the research strategy domain” (McGrath,
1981 https://doi.org/10.1177/000276428102500205 )
Three main dimensions…

• Obstrusiveness: to what extent does a researcher
“intrude” on the research setting, or simply make
observations in an unobtrusive way (i.e., how much
control do I have on the empirical settings)

• Generalizability: to which extent the research
ﬁndings are generalizable (i.e., how much of the
scope of interest is it covered, given the current
scope of validity)
And two other dimensions…
A Unifying Framework: ABC
of SE Research

ABC Framework
more
obtrusive
less
obtrusive
more
general
less
general
Precise characterisation
of Behaviour
is relevant
Precise characterisation
of speciﬁc Context
is relevant
Generalizability
over Actors
is relevant
A
B
C
Note: Actors can be People or Software; Behaviour of People or Software

Fig. 1. The ABC framework: eight research strategies as categories of research methods for software engi-

Jungle
Natural
Reserve
Flight SimulatorIn Vitro Experiment
Courtroom
Referendum
Mathematical Model Forecasting System

Field Studies (Jungle)
• Purpose: To investigate the impact of distributed teams
in software development

• Setting: Natural, ﬁrst author spent 7 months on-site at an
organization.

• Procedure: Document study, observation, interviews.

• Findings: Four major problems and 8 speciﬁc challenges
Example

• Setting: Natural setting that exists before the researcher enters it. Minimal intrusion of
the setting so as not to disturb realism, only to facilitate data collection.

• Purpose:

• Exploratory, to understand what’s going on, how things work, or to generate
hypotheses.

• Typical Methods and Data: Case study, ethnography, observational study; qualitative
data incl. interviews, ﬁeld notes, archival documents, may include quantitative data.

• Inherent Limitations:
• No statistical generalizability

• No control over events

• Low precision of measurement

• Essence:

• Facilitates the study of real-world actors (people, systems) and their
behaviors in a natural setting that is not manipulated by the
researcher.

• High potential to capture realistic settings and a high degree of detail
of a particular system and context.

• Evaluation Considerations:

• Not suitable to investigate statistical relationships, or to otherwise
manipulate variables,

• Not suitable for ﬁndings that hold for larger populations.

Field Experiments
(Natural Reserve)
• Purpose: To identify a cost-effective way to avoid software
defects.

• Setting: Natural, company staff and researcher collaborated
on-site, using real products to evaluate new approaches.

• Procedure: Action research (improving case study, design
science), data include defect reports, time spent, usability
issues, timeliness of the project, product sales.

• Findings: certain techniques are beneficial, while other are
time consuming and do not avoid defects

Field Experiments
(Natural Reserve)
• Setting: Natural, pre-existing setting (in vivo), but some level of intrusion
due to the deliberate manipulation of aspects of the setting; study
aﬀected by confounding factors.

• Purpose: To investigate, evaluate, or compare techniques, practices,
processes, or approaches within a real-world and pre-existing setting.

• Typical Methods and Data: case study, quasi-experiment, action
research; studies may use either quantitative data or qualitative data.

• No statistical generalizability

• Precision of measurement aﬀected by confounding contextual factors

• Essence:

• Facilitates the study of effects of a modification of properties of a
studied entity or phenomenon that occurs in a natural setting, i.e.,
pre-exist independent of the researcher.

• Potentially very costly to set up due to complexity of natural settings.


• Limited level of precision of measurement;

• Results not generalizable, but strongly linked to the specific setting
due to confounding variables that are very difficult to isolate.
Field Experiments
(Natural Reserve)

Experimental Simulations
(Flight Simulator)
• Purpose: To understand how developers perceive the testing
team

• Setting: Contrived, simulation environment with experimental
stimuli that were previously deﬁned (e.g., the software to be
written by the developers, the types of checks performed by
testers)

• Procedure: developers develop code, testers test and give
feedback during a meeting, impressions of developers are
observed

• Findings: Insights into defensive reactions of the developers

(Flight Simulator)
• Setting: Contrived setting (in virtuo) created specifically for a study to represent a
concrete type of setting. Environment is created by the researcher to study
behavior of actors.

• Purpose: To study behavior of participants or systems in a controlled setting that
resembles a real-world, concrete class of settings as closely as possible.

• Typical Methods and Data: Simulation/Role-playing games, management
games, instrumented multiplayer games; quantitative or qualitative data,
depending on the simulation instrument.

• Generalizability reduced as the setting is designed to mirror a specific type of
setting (e.g. I have specific subjects from a company)

• Realism reduced due to artificial setting Similar to Lab experiment,
but more context-specific

• Essence:

• A contrived setting that simulates a specific class of real-world systems
that to some extent resembles reality.

• Temporal flow of events depends on the simulation environment and
actors’ behavior, which allows for observing more natural behavior than a
laboratory experiment.


• Reduced level of realism compared to field experiments due to the
contrived setting

• Behavior of actors may reflect that in natural settings, but consequences
for actors lack realism, which may affect their behavior.
(Flight Simulator)

Laboratory Experiments
(in Vitro Experiments)
• Purpose: To investigate the hypothesis that a certain
code inspection method A is more effective than another
method B

• Setting: Contrived, laboratory exercise with graduate
students

• Procedure: Measurement of effect of inspection methods
on 4 dependent variables including fault detection rate

• Findings: inspection method A is more effective than
inspection method B

• Setting: Contrived setting (in vitro) created speciﬁcally for a study, with high degree of
control of all measured variables.

• Purpose:
• to study with a high degree of precision relationships between variables, or comparisons
between techniques;

• may allow establishment of causality between variables.

• Typical Methods and Data: Randomized controlled experiments and quasi experiments,
comparative evaluations with benchmark studies; usually quantitative data exclusively.

• Abstract or unrealistic context due to highly artiﬁcial setting

• Scope of problem reduced to study the “essence”, optimizing internal validity at cost of
external validity

• Essence:
• A controlled setting where behavior of actors (humans or systems) is
carefully measured through a number of discrete trials to establish
eﬀects or conduct comparative analyses.

• Maximum potential to capture precise measurement of variables (high
internal validity) due to potential to isolate confounding factors.


• Studied relationships and variables are more abstract due to the
contrived and “sterile” nature of the research setting.

• The setting is more artiﬁcial than for experimental simulations

Judgment Studies
(Courtrooms)
• Purpose: To evaluate a set of 12 practices based on
feedback by team managers.

• Setting: Neutral, dedicated meeting room with seating
around a table.

• Procedure: 10 managers from 7 companies, selected
based on their interest and expertise.

• Findings: a framework of defect and beneﬁts for the 12
practices

Judgment Studies
(Courtrooms)
• Setting: Neutral setting; may be actively designed to nullify the context, so that
“responses” are in relation to some stimulus (question or instructions), independent
of setting.

• Purpose: To elicit information from subjects for purposes of evaluation or study of
some entities.

• Typical Methods and Data: Delphi studies, interview studies, focus group,
evaluation studies; use of qualitative and/or quantitative data.

• Responses not related to any speciﬁc or realistic context

• Less generalizability than sample studies due to lack of representative sampling

• Less control and precision of measurement than a lab. exp.

• Essence:

• Facilitates study of responses or behavior of actors that
bears no relation to the research setting, which is
neutral or actively “neutralized.”

• Allows for more complex questions and interactions
between researcher and respondents.

• Evaluation Considerations: No concrete or natural
setting, which prohibits capturing direct observations of
phenomena.
Judgment Studies
(Courtrooms)

Sample Studies
(Referendum)
• Purpose: To investigate the state of practice of
requirements engineering in industry.

• Setting: Neutral, web-based questionnaire.

• Procedure: 22 questions; participants drawn from
internet; 194 responses from a population of 1,519

• Findings: Findings include organization and participant
characteristics (various domains; participants held variety
of positions); software development life cycle model
(agile, waterfall, etc.); RE techniques.

Sample Studies
(Referendum)
• Setting: Neutral setting. Limited level of precision of measurement; no variables
are manipulated. The researcher must deal with whatever data is collected.

• Purpose: To study the distribution of a particular characteristic in a population (of
people or systems), or the correlation between two or more characteristics in a
population.

• Typical Methods and Data: Software repository mining, surveys, questionnaires,
interviews; analysis includes correlational methods, e.g., regression. Typically,
quantitative data (e.g., Likert scales) but can include qualitative data.

• Reductionist—depth of and number of data points per participant limited

• Data collection not “interactive”: no option to clarify questions; repository data
comes as is, no opportunity to manipulate variables, only to correlate them

• Essence:

• Facilitates data collection from a representative sample of a population (human
or nonhuman, such as systems or design artifacts).

• Maximum potential to generalize ﬁndings to a wider population;

• Unobtrusive research strategy.


• Questions tend to be “simple”;

• Limited opportunity for “complex” interaction between the researcher and
subjects.

• Research setting oﬀers no realistic context.
Sample Studies
(Referendum)

Formal Theory
(Mathematical Model)
• Purpose: To develop an understanding of the role of
creativity in RE.

• Setting: no empirical observations, but derivation of a
conceptual framework from literature.

• Procedure: check general creativity literature, check
requirements engineering creativity literature.

• Findings: A theoretical framework that oﬀers RE
researchers a basis to incorporate creativity in RE
methods and techniques.

Formal Theory
• Setting: Nonempirical setting; typically a research oﬃce or library.

• Purpose:

• To develop a conceptualization, framework, or theory on a topic.

• Focus is on formulating relations among concepts, or explanations that hold for a
wide range of contexts.

• Typical Methods and Data: literature reviews, Conceptual reasoning, concept
development, development of propositions and/or hypotheses; framework
development.

• Low on realism: does not consider a speciﬁc context but rather abstract concepts

• No manipulation of variables or measurement (no empirical information is gathered)

• Essence:
• The careful and justiﬁed construction of a theoretical model that
represents one view of a phenomenon, which helps to analyze
or explain the real world.

• Model generic behavior for a range of classes of populations
(humans or nonhuman artifacts), which serves to make
predictions or explanations about the real world.

• Theoretical models do not generate new empirical observations,
though may inform future empirical studies.
Formal Theory

Computer Simulations
(Forecasting system)
• Purpose: To investigate bottlenecks and overload in the
testing processes.

• Setting: Nonempirical, a discrete event simulator was
implemented.

• Procedure: Four simulation scenarios with different
parameter values to model different circumstances.

• Findings: Two ways were identified to avoid congestion: 
(1) increase number of staff, (2) increase the number of
interactions with the development team.

• Setting: Nonempirical setting (in silico); no recording of observations in the real world.
There are no actors (people, real-world systems) or real-world behavior: everything is
speciﬁed in the simulation.

• Purpose: To model a particular system or phenomenon that facilitates evaluation of a
large number of complex scenarios that are captured in the preprogrammed model.

• Typical Methods and Data: Development of software programs that contain symbolic
representations of all variables a researcher considers important; usually these variables
are derived and calibrated based on prior empirical studies.

• No empirical data is gathered

• Results will be as good as the accuracy of the model representing the simulated
system

• Low generalizability as it attempts to model a speciﬁc class of real-world systems

• Essence:
• Represents a symbolic replica of a concrete real-world system
where all conﬁgurations and variables are preprogrammed.

• Useful to run a large number of complex scenarios to explore a
solution space, which might not be feasible to do manually.


• All simulation rules are preprogrammed: no new empirical (i.e.,
real world, as opposed to simulated) behavior is observed.

• Due to concrete implementation, limited generalizability.

Jungle
Natural
Reserve
Flight SimulatorIn Vitro Experiment
Courtroom
Referendum
Mathematical Model Forecasting System
Frequent
in SE
Rare
in SE

7 Commandments of the Empirical
Software Engineer (From Daniel Méndez)
1. No such thing as absolute and / or universal truth (truth is always
relative) 

2. The value of scientiﬁc theories always depends on their

• ability to stand criticism by the (research) community, 
• robustness / our conﬁdence (e.g. degree of corroboration), 
• contribution to the body of knowledge (relation to existing evidence) 
• ability to solve a problem

3. Theory building is a long endeavour where 
• progress comes in an iterative, step-wise manner, 
• empirical inquiries need to consider many non-trivial factors, 
• we often need to rely on pragmatism and creativity 
• we depend on acceptance by peers (research communities)

https://www.slideshare.net/mendezfe/an-introduction-into-philosophy-of-science-for-software-engineers

4. Be sceptical and open at the same time 
• no statement imposed by authorities shall be immune to criticism 
• be open to existing evidence and arguments/explanations by others 

5. Be always aware of 
• strengths & limitations of single research methods 
• validity and scope of observations and related theories 
• relation to existing body of knowledge / existing evidence 

6. Appreciate the value of 
• all research processes and methods 
• null results (one’s failure can be another one’s success)  
• replication studies (progress comes via repetitive steps) 

7. Be an active part of something bigger (knowledge is built by
communities)

Empirical Methods in Software Engineering - an Overview

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Empirical Methods in Software Engineering - an Overview

Similaire à Empirical Methods in Software Engineering - an Overview (20)

Plus de alessio_ferrari

Plus de alessio_ferrari (9)

Dernier

Dernier (20)

Empirical Methods in Software Engineering - an Overview