Submitted 12 November 2015
Accepted 7 August 2017
Published 4 September 2017

Corresponding author
Fabian Fagerholm,
fabian.fagerholm@helsinki.fi

Academic editor
Perdita Stevens

Additional Information and
Declarations can be found on
page 31

DOI 10.7717/peerj-cs.131

Copyright
2017 Fagerholm et al.

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

Guidelines for using empirical studies in
software engineering education
Fabian Fagerholm1,*, Marco Kuhrmann2,* and Jürgen Münch1,3,*

1 Department of Computer Science, University of Helsinki, Helsinki, Finland
2 Institute for Applied Software Systems Engineering, Clausthal University of Technology, Goslar, Germany
3 Herman Hollerith Center (HHZ), Reutlingen University, Böblingen, Germany
* These authors contributed equally to this work.

ABSTRACT
Software engineering education is under constant pressure to provide students with
industry-relevant knowledge and skills. Educators must address issues beyond exercises
and theories that can be directly rehearsed in small settings. Industry training has
similar requirements of relevance as companies seek to keep their workforce up to
date with technological advances. Real-life software development often deals with large,
software-intensive systems and is influenced by the complex effects of teamwork and
distributed software development, which are hard to demonstrate in an educational
environment. A way to experience such effects and to increase the relevance of software
engineering education is to apply empirical studies in teaching. In this paper, we show
how different types of empirical studies can be used for educational purposes in software
engineering. We give examples illustrating how to utilize empirical studies, discuss
challenges, and derive an initial guideline that supports teachers to include empirical
studies in software engineering courses. Furthermore, we give examples that show how
empirical studies contribute to high-quality learning outcomes, to student motivation,
and to the awareness of the advantages of applying software engineering principles.
Having awareness, experience, and understanding of the actions required, students are
more likely to apply such principles under real-life constraints in their working life.

Subjects Computer Education, Software Engineering
Keywords Software Engineering Education, Computer Science Curricula, Teaching Methods,
Empirical Studies, Experimentation, Education, Guideline

INTRODUCTION
Providing relevant knowledge and skills is a continuous concern in software engineering
education. Students must be exposed to realistic settings to understand why applying
fundamental software engineering principles is necessary, why decisions should be
grounded in evidence, and to learn to foresee long-term and delayed effects of certain
behaviour or decisions in software projects. Using empirical instruments is one approach
to teach relevant software engineering knowledge and skills. The goal of this paper is to
use our teaching experiences to develop practice-grounded guidelines that help teachers
include empirical instruments in their teaching.

Since real-life software development routinely deals with large, software-intensive
systems and is influenced by the manifold and complex effects of teamwork and distributed

How to cite this article Fagerholm et al. (2017), Guidelines for using empirical studies in software engineering education. PeerJ Comput.
Sci. 3:e131; DOI 10.7717/peerj-cs.131

https://peerj.com
mailto:fabian.fagerholm@helsinki.fi
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.131
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://dx.doi.org/10.7717/peerj-cs.131


software development, software engineering education must enable students to understand
such environments and to apply knowledge properly and effectively. However, restrictions
in the academic curriculum and the complexity and criticality of real software products
limit the level of realism that can be achieved in education. As problems are narrowed down
to be manageable, practical relevance is lost through scope and problem size limitations
and the use of artificial settings instead of real-world problems. Many effects only become
visible over long time periods, e.g., the efficiency of a particular method or eventual impact
of a design decision. Often, time is too limited to provide adequate means to experience
such effects in a single course. The same problem occurs in practitioner training. Industry
must quickly develop solutions and services in order to deliver customer value and,
eventually, survive in market competition. Empirical evidence may not be easily available,
and practitioners may resort to decision-making based on by biased individual beliefs,
negatively affecting the productivity of development teams.

Over the years, we have implemented empirical instruments in software engineering
courses to (1) provide an environment in which students can experience real-life problems
while increasing their motivation and the quality of learning outcomes, (2) pave the way
for conducting research in collaboration with industry, and (3) apply these instruments
in industry for training purposes. In our experience, this approach has been well
suited to prepare students for working life. Training in empirical instruments, such as
experimentation or case study research, and direct experience of the value provided by
them, has encouraged our students to apply their acquired knowledge in practice, and
to explore problems, new methods, and new tools in a systematic and evidence-based
manner. Since empirical instruments are well accepted for conducting (applied) research
in industry, and since students can form their own experiences when doing empirical
studies, we claim that utilizing empirical instruments in teaching increases the quality as
well as the practical relevance of SE education.

We show how different instruments, ranging from controlled experiments to qualitative
studies, can be used for teaching purposes. We also consider overarching approaches
that situate empirical studies in a larger context. We systematize purposes and challenges
of the different study types, discuss the validity of the results that can be obtained in a
teaching context, and create a link between teaching and research. We use a selection of
representative studies to discuss the impact on teaching as well as on industry relevance.
The guidelines developed in this paper provide a systematic collection of purposes, learning
goals, challenges, and validity constraints, and aims to support teachers in selecting proper
study types for inclusion into their courses.

The remainder of this paper is structured as follows. ‘Related Work’ reviews related
work on empirical studies as a teaching instrument. ‘Research Approach’ describes
the research approach taken in this paper. ‘An Overview of Empirical Study Types for
Software Engineering Education’ discusses empirical instruments in SE education and
provides an initial taxonomy of them. ‘A Guideline for Integrating Empirical Studies with
Software Engineering Courses’ presents the main contribution of the paper: a guideline
for integrating empirical studies with software engineering courses. ‘Experiences’ gives
examples of implementing empirical instruments in university education and discusses the

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 2/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


implications of integrating them into courses. ‘Conclusion’ provides conclusions and lists
possible future work.

RELATED WORK
Experiments and other types of empirical studies are key to the scientific approach.
Empirical studies are performed in research for many different purposes, such as
understanding real-world phenomena, testing hypotheses, and validating theories (Wohlin
et al., 2012; Runeson et al., 2012; Kitchenham, Budgen & Brereton, 2015). Empirical studies,
especially experiments, are established only in few disciplines. The use of empirical studies as
teaching tools is less common than, for example, classroom exercises and lectures. In many
areas, such as software engineering, such utilization of empirical studies is still uncommon
compared to learning tasks that involve reading or writing about existing research, and
individual exercises that focus mainly on small-scale technical implementation.

Empirical studies in teaching in other disciplines
Physics education may be a prime example where, due to the historical development
of the discipline, experiments have a central pedagogical role. Beyond their function
as means for verifying or refuting theories, experiments in physics have a generative
function with relevance for education and learning (Koponen & Mäntylä, 2006). While the
type of problems in physics and software engineering are different, experiments play a
similar role for learning in both. An example of another discipline, also different in nature
from physics, that already has a high level of maturity in using experiments for teaching
purposes is economics. Experiments became widespread teaching tools in economics
in the 1990s (Parker, 2014). Nowadays, many economists use experiments as educational
tools. Parker (2014) mentions several benefits of using experiments: they are distinctive and
more participative, and in consequence, students are likely to remember lessons associated
with them. Parker also mentions that the experiential component in experiments can be
very important and that students and instructors usually think that experiments are fun.

Experiments can be used as part of many educational approaches. For example,
experiments could be used in different ways with problem-based learning (PBL) (Barrows
& Tamblyn, 1980; Wood, 2003). In PBL, students define their own learning objectives
connected to a problem scenario, while the tutor ensures that the objectives are ‘‘focused,
achievable, comprehensive, and appropriate’’ (Wood, 2003). One possibility is to have the
tutor guide students towards objectives that involve different degrees of experimentation,
e.g., formulating research questions, defining research designs, or even carrying out studies
with real or simulated data. The tutor may provide data as part of the problem scenario;
it may be part of the trigger material provided to students. Experiments can also be
used as part of project-based learning (Blumenfeld et al., 1991), where students actively
explore real-world challenges and problems. Instructors can introduce experiments when
important decision-making and knowledge acquisition needs emerge.

In order to support the design of constructive education with experiments embedded,
as well as to support experimentation within more traditional teaching, teachers would
benefit from guidelines or sets of ready-made experiment templates that they could use

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 3/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


either when planning or dynamically during teaching. The SERC Portal for Pedagogy
in Action, created by Ball et al. (2012), provides a repository of classroom experiments.
Ball et al. define such experiments as ‘‘activities where any number of students work in
groups on carefully designed guided inquiry questions’’. Students collect data through
interaction with typical laboratory materials, data simulation tools, or a decision-making
environment, as well as ‘‘a series of questions that lead to discovery-based learning.’’ The
repository includes a comprehensive list of experiments from different disciplines that
can be used for replication in classroom settings. In addition, it contains references to
scientific studies that provide empirical evidence about the expected positive effects of
experiments as teaching tools. An example is an empirical investigation of the impact of
classroom experiments on the learning of economics (Frank, 1997). Several of the cited
studies show a higher academic achievement (e.g., measured as increase in students’
homework scores) when using experiments compared to control classes where standard
lectures are used. Ball et al. (2012) also cite studies that show improved student satisfaction
with teaching pedagogy when using experiments. The repository also contains guidelines
for designing and conducting experiments as part of teaching. The guidelines include
important aspects such as strategies for unexpected outcomes of experiments.

Requirements for applying empirical studies in teaching
The discussion regarding the suitability of experiments mainly focuses on criteria that
need to be fulfilled for designing successful experiments, the balance between practical
work and theory, and the suitability of students as experimental subjects. Parker (2014)
mentions basically three criteria: (1) the experiment must be aligned with the central topic
of the course, (2) the concept to be taught through the experiment should not be easily
understood without the experiment or already be obvious, (3) students need to be able to
quickly learn the necessary prerequisites for participating in the experiment.

Dillon (2008) provides an overview of advantages and disadvantages of experiments
based on empirical findings. An important conclusion drawn from the overview is that
successful observation of a phenomenon as part of an empirical study should not be an end
in itself. Rather, students should have enough time to get familiar with the related ideas
and concepts associated with the phenomenon.

Empirical studies in software engineering education
In software engineering, experimentation was established in the 1980s. Basili, Selby
& Hutchens (1986) were among the first to present a framework and process for
experimentation. Since then, software engineering experiments in classroom settings
have become more common. However, the focus of most of such experiments has been to
gain research knowledge, with students participating as research subjects. Less attention
has been paid to using empirical studies with an educational purpose in mind, where the
experiment has an explicit didactic or experiential role. Few curricula are available that
include the execution of empirical studies as an integral part of a lecture (e.g., Kuhrmann,
Fernández & Münch, 2013; Hayes, 2002).

The use of students as experimental subjects has often been discussed in the literature. In
software engineering, the topic has mainly been analysed for understanding the suitability

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 4/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


of students as subjects compared to professional practitioners as subjects. An example of
such an investigation has been presented by Runeson (2003). Carver et al. (2003) note that
while it is common to carry out empirical studies in software engineering with students as
subjects, the educational value of the studies is often overlooked. Simultaneously, solving
the pedagogical challenges involved is not straightforward. Carver et al. discuss costs and
benefits for researchers, students, instructors, and industry, and provide a check-list with
advice for carrying out empirical studies with student subjects. The same authors have later
extended their check-list with requirements for successful empirical studies with students,
based on previous literature (Carver et al., 2010). The check-list includes items addressing
considerations before a class begins, as soon as it begins, when the study begins, and when
the study is completed. The authors emphasise integration of the study and course topic
and schedule, documentation, and considerations of study validity.

Only few studies have investigated the impact on learning of empirical studies in the
curriculum. We expect that the effect is generally positive as long as the integration is
carried out properly. Staron (2007) finds that students’ learning process is improved and
that including carefully designed experiments into software engineering courses increases
their motivation. A large majority (91%) of students who participated as subjects in the
experiments found them useful, and the number of high-passes increased by 41% after
introducing experiments.

While many articles report on empirical studies using student subjects, and some articles
report on the educational benefits of such studies for students, few papers address empirical
studies as an overall strategy for software engineering education. In particular, there is a
lack of guidance for using empirical studies in software engineering education in cases
where students may not only be research subjects but could also be involved in carrying out
the studies. An overview that discusses different types of empirical studies, their suitability
for education, as well as challenges with respect to their execution is missing.

RESEARCH APPROACH
The goal of this paper is to develop guidelines that help teachers integrate empirical
instruments in software engineering education. The guidelines are based on a reflective
analysis of our experiences with teaching courses that use empirical elements to support
learning objectives. A reflective approach has been recognised by many educational
researchers as a prerequisite for effective teaching (e.g., Hatton & Smith, 1995; Cochran-
Smith, 2003; Jones & Jones, 2013). Reflective practice, with roots in the works of Dewey
(1935) and Schön (1983), calls for continuous learning through deliberate reflection in
and on action. Using empirical instruments in software engineering education is a way
to encourage students to reflect, but teachers should do the same. This paper represents
one outcome of reflection-on-action: we analyse materials, assignments, notes, course
syllabi, schedules and structures, evaluation data, and recollections of important factors in
a number of our own courses, and derive guidelines that we believe would help teachers
implement similar courses.

Our approach is mainly qualitative and has proceeded from gathering a list of study types
through analysis of materials and experiences relevant to each study type to the guideline

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 5/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


proposed in this paper. Here, analysis refers to categorisation of materials and identification
of connections and relationships between categories. Our main goal of developing the
guideline helped to scope our investigation, and we thus left out material which did not
serve this goal. We began by sifting through parts of the published literature on software
engineering education and methods in order to shape a first outline of a taxonomy of
study types. In particular, we were influenced by Höst (2002) and Carver et al. (2003)
when considering software engineering education, and by Shull, Singer & Sjøberg (2008)
and Kitchenham, Budgen & Brereton (2015) when considering the methodological aspects.
Our search was purposive rather than systematic, as we sought to construct a taxonomy
(see ‘An Overview of Empirical Study Types for Software Engineering Education’) for use
in the guidelines rather than for the purpose of representing the state of the art in the
scientific literature.

After constructing the taxonomy, we analysed qualitative data from our own courses
and arranged it according to five categories: (1) learning goals, purposes, challenges, and
validity, (2) establishing context and goals, and determining a study type, (3) motivating
students, (4) scheduling, (5) other considerations. We summarised the qualitative data
in each category by removing the details specific to our courses and generalising the
insights so that they can be applied more broadly. We then constructed the guideline
by cross-referencing the categories so that the purpose, challenges, and validity concerns
relevant for each study type is shown. The result is given in ‘A Guideline for Integrating
Empirical Studies with Software Engineering Courses’.

Finally, we revisited the material from our courses and picked examples that illustrate
how we tackled some of the choices teachers face when using empirical instruments for
education. We also addressed the specific question of evaluating our teaching by providing
data from formal as well as informal evaluation (see ‘Experiences’). This serves as a first
validation of the guidelines.

AN OVERVIEW OF EMPIRICAL STUDY TYPES FOR
SOFTWARE ENGINEERING EDUCATION
The software engineering literature includes a number of empirical studies with students,
and often these studies were conducted in an educational setting. In this section, we give
an overview of (empirical) study types utilised in software engineering education. We
list common instruments from empirical software engineering and provide examples of
how these instruments can be applied to teaching. The overall goal of this section is to
summarise different study types that can be used in software engineering education. The
summary supports the development of an initial common taxonomy that categorises study
types. The taxonomy helps to determine the appropriateness of a particular instrument in
a specific setting. A concise overview of the study types, including a brief description and
outline of the potential positive educational aspects, is given in Table 1.

Case studies
Case studies aim to investigate a phenomenon in its natural context. When utilised for
educational purposes, case studies can omit some aspects of a full research design (Yin,

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 6/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Table 1 Summary of empirical study types.

Type Description Potential for education

Case study Investigate phenomenon in its natural context.
Especially suitable for exploratory and explana-
tory designs. Results grounded in context.

Gaining observational and analytic skills.
Observing real scenarios with real objectives
and constraints. Knowledge of high rele-
vance for professional use.

Formal and semi-formal experiment Investigate effect of treatment under controlled
conditions. Rigorous design requirements. Re-
sults constitute tests of a theory.

Demonstrate the real impact of theory. Gain
skills to formulate and test a theory.

Continuous experimentation Constant series of experiments to test value
creation, delivery, and capture of software or
software-based products. Results of experiments
can be used to make design decisions.

Understand connection between software
development and business and customer
domain. Gain skills to test product assump-
tions to provide evidence for product deci-
sions.

Software process simulation Simulation model used as abstraction of a real
process. Cost and time advantages can be ob-
tained. Requires a valid model.

Gain understanding of process dynamics
and complexity with limited resources. Ex-
perience effects of decisions.

Individual studies E.g., Bachelor’s or Master’s theses. Focused
work on a specific problem for a limited time.
Various studies are possible.

Learn to conduct a study in a self-organised
manner. Gain domain knowledge.

Further instruments Augment or provide context for aforemen-
tioned study types, e.g., replication studies, OSS
projects.

Provide ways to enhance other study types.

2009), but can borrow from design science methodology (Hevner et al., 2004) where an
artefact is designed, implemented, and evaluated in order to learn.

When performing case studies with industry, the context is provided by business
objectives and realistic constraints (Brügge, Krusche & Alperowitz, 2015). Industry aims to
develop results which contribute to solving a problem with relevance in their settings.
For instance, developers can be trained in a close-to-reality environment, aiding the
understanding of situations that will occur in the (near) future. On the other hand,
researchers perform case studies to understand and capture phenomena in their natural
context. Depending on the rigorousness of the study design, both practitioners and
researchers benefit from a case study due to its grounding in realistic settings. Apart from
‘‘normal’’ case study research, teachers can use the case study instrument to help motivate
students by providing problems with visible real-life applications. Case studies also help
teachers to transmit procedural knowledge, as students are required to formulate problems
and design solutions, and to evaluate them.

Case studies help answering explanatory questions of the type ‘‘How?’’ or ‘‘Why?’’ They
should be based on an articulated theory regarding the phenomenon of interest (Yin, 2009).
A case study can then provide additional evidence for a theory, help to modify or refine
a theory, or suggest an alternative theory that better fits the observations. Furthermore, a
case study can also be utilized to discover (new) interesting and relevant issues. Case studies
can be implemented in different ways. They can be categorized as single- or multiple-case,
holistic or embedded (Wohlin et al., 2012; Runeson et al., 2012; Yin, 2009), or as intrinsic
(Stake, 1995; Baxter & Jack, 2008). They can be deductive or inductive, exploratory or

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 7/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


confirmatory, and they can make use of both quantitative and qualitative data (Yin, 2009;
Eisenhardt, 1989). In the context of teaching, the normal case study setup is a holistic
single-case study in which a single instance of the unit of analysis (case) is examined.
More complex designs, e.g., multiple-case studies, increase the value of the study results
for research. However, these aspects can be considered less important for teaching.
Furthermore, setting up a case study—even in teaching—requires an environment in
which the phenomenon of interest occurs naturally.

Case studies are a valuable source for generating diverse results. From the industry
point of view, case studies help to elaborate and understand the value given by reaching
the case objective, ranging from increased technological understanding to increased
understanding of customer value of a product or service. They help uncover real technology-
and knowledge-related challenges involved in reaching the objective and, thus, provide
information on the cost, effort, and risks involved in the case. From the perspective of
researchers, case studies can contribute to the development of general—but context-
bound—technological rules, including case-specific insights and lessons learnt. Given
replication with multiple cases, the rules can also reach the level of more general theory.
Moreover, case studies can be fruitful grounds for exploration and help to discover or
identify research questions.

From the teaching perspective, case studies serve several purposes of which gaining
observational and analytic skills are the most important. Case studies help participants to
get insights into a setting in which a particular phenomenon occurs. Therefore, analysing
problems and deriving tasks to solve the problem happens in real scenarios rather than
synthetic situations. Solutions can be evaluated against real objectives and constraints.
Consequently, this kind of learning produces knowledge of higher relevance for professional
use, and teaching directly addresses subject matter and procedural knowledge related to a
specific problem type.

Examples of case studies in teaching
Fagerholm, Oza & Münch (2013) describe the Software Factory, which is an instrument to
combine software development education and training with conducting empirical research.
A fruitful ground for this kind of teaching is global software development, as demonstrated
by, e.g., Oza et al. (2013), Richardson, Milewski & Mullick (2006), and Deiters et al. (2011).
In the Software Factory environment, students work with a company on a real software
development project, providing a level of realism that is not available in a regular course
exercise. This realism, along with the opportunity to work in a team setting provides the
potential to conduct case studies with educational relevance.

Formal and semi-formal experiments
According to Wohlin et al. (2012), an experiment (controlled experiment) is defined as
‘‘an empirical inquiry that manipulates one factor or variable of the studied setting.’’
Different treatments are applied, or treatments are assigned to different subjects, to
measure effects on variables of interest. If treatment is not randomly assigned, we speak of
a ‘‘quasi-experiment.’’ Experiments aim to investigate settings in which the environment

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 8/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


is under control, and effects of interest are investigated by manipulating variables. For
instance, if the efficiency of a particular method is subject to investigation, one experiment
group is assigned to solve a problem with the ‘‘new’’ method, while another group works
on the same task, but using another method. Results are then compared, e.g., to accept
or reject a hypothesis. Thus, experiments can be utilised to test theories and conventional
wisdom, explore relationships, to evaluate the accuracy of models, and to validate measures.
Importantly, experiments should always provide a detailed context description, showing
the settings in which certain claims are true, and in which certain techniques or tools
are beneficial.

Experiments require rigorous design. Wohlin et al. (2012) present an experiment
process which consists of scoping, planning, operation, analysis and interpretation,
and presentation and packaging. However, providing a general experiment design is
demanding, as the design depends on the respective subject and context. Apart from the
general experiment process, several smaller guidelines exist to direct researchers through
the process, e.g., the goal template from the TAME project (Basili & Rombach, 1988),
experimentation packages providing reusable designs, templates, and so forth (e.g., in
the context of self-organizing project teams (Kuhrmann & Münch, 2016b), we created
such a template; another example can be found in Fucci, Turhan & Oivo (2015)), and,
pragmatically, case study designs, which can be derived from, e.g., Runeson & Höst (2009),
who provide advice on planning and a guideline on reporting case study research.

Experiments provide several trade-offs for the conducting parties. However, the
usefulness depends on the respective context. For instance, while the importance of
experiments in research is not questioned, experimentation in industry has to be considered
in terms of business value (e.g., by providing new, efficient methods or creating and
evaluating software prototypes, paving the way for new products). Requirements regarding
the validity of the results differ as well as the general scope of experiments. Furthermore,
as we have previously discussed, small and very small companies usually have insufficient
resources to invest in the necessary preparation and allocation of resources (Kuhrmann,
2015). Nevertheless, experimentation allows for, e.g., evaluating different methods and
tools, building a hypothesis, and testing the hypothesis. Furthermore, experiments help to
confirm conventional wisdom, e.g.: ‘‘Everybody says that Follow-the-Sun development is
fast but expensive—is this also true for our situation?’’.

Examples of experimentation in teaching
For teaching, experiments can be a valuable source for knowledge and experience. For
instance, experiments can be used to elaborate the real impact of theoretical concepts: in
Kuhrmann & Münch (2016b), the theoretical concept taught was the well-known Tuckman
Model (Tuckman, 1965), and in a complementing experiment, students could experience
the effects of group dynamics themselves, e.g., changing team set-ups or external influences.
Another experiment reported in Kuhrmann & Münch (2016a) creates a setting in which
students can experience the crucial role of communication (and absent communication)
in distributed development set-ups.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 9/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


In Kuhrmann, Fernandez & Knapp (2013), we present a controlled experiment on
the perception of software process modelling paradigms. A German NGO sponsored a
process description, which the students had to analyse and improve according to a given
approach (Kuhrmann, Fernández & Münch, 2013). Students used two different process
development environments, each implementing a different modelling paradigm. They
went through the process life cycle, learned about analysis-, design-, and realisation
tasks, and conducted result assessments. Furthermore, the experiment outcomes showed
advantages and disadvantages of the particular modelling paradigms.

Continuous experimentation
Continuous experimentation refers to a constant series of experiments to test the value of
software capabilities such as features early in its design process (Fagerholm et al., 2014a;
Fagerholm et al., 2017). The major driver is the industrial need to better understand product
value delivery, so that development activities can be focused on delivering only capabilities
that create value for users or customers. Our experience shows that it is a mistake to
ignore the value aspect in SE education, as it is a critical part of understanding software
requirements. This is especially relevant in complex domains where requirements are
unknown and cannot be elicited up front.

In contrast to empirical software engineering that usually focuses on technical product or
process aspects from a developer perspective, the purpose of continuous experimentation
is to validate assumptions that are underlying a business model or a product roadmap. The
perspective is usually that of a product owner or entrepreneur. Continuous experimentation
is a means to evolve business models, product roadmaps, or feature scopes based on
validated assumptions. It is based on approaches such as Lean Startup (Ries, 2011) and
Customer Development (Blank, 2006), and enforces product managers and developers to
connect with real users (e.g. through interviews or by analysing usage data) in order to test
critical assumptions and make evidence-based product decisions. This typically requires,
e.g., the execution of experiments in a scientific style and the implementation of feedback
channels that allow observing user behaviour. For teachers, it is a great challenge to instruct
students accordingly, as classic software engineering teaching is usually separated from
value considerations. Thus, creating a mind-set in which value creation is the baseline for
all development tasks impacts the way software development is performed in the sense that
decision-making is based on continuously obtained evidence about customer value.

In order to conduct continuous experimentation, different designs can be applied
depending on the hypotheses or study goals under investigation. A typical design is a
case study consisting of a sequence of build-measure-learn cycles that develop a so-called
minimum viable product (MVP). Simply speaking, an MVP is a prototype that allows for
testing with potential customers. Such testing requires a design to quickly obtain customer
feedback during the study. In consequence, access to potential customers is needed to
conduct such studies.

From the industry perspective, continuous experimentation results in knowledge
that supports or refutes assumptions about product value. An MVP might result from
an experiment. Such a result might consist of a working prototype as well as data or

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 10/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


lessons learnt about the customer value of the prototype, its development process, and
potentially about relevant customer segments. The study might also contribute to testing
of other critical assumptions of a business model, such as assumptions about customer
relationships or channels.

For researchers, continuous experimentation helps to better understand processes,
methods, techniques, tools, and organizational constraints regarding building the ‘‘right’’
software.

From the teaching perspective, continuous experimentation helps students understand
the connection between software development techniques and business. Since such
experiments must begin by analysing product-related assumptions, students naturally
come into contact with the product’s business model. They must then make the link
between such assumptions and the corresponding technical implementation and devise
an experiment which allows them to refute or support the highest-priority assumption,
yielding evidence for a product-related decision. Continuous experimentation can thus
foster the awareness of relevant criteria for software beyond cost, reliability, and effort, e.g.,
usability, usefulness, success (e.g., contribution to a higher level organizational goal), and
scalability (e.g., monetization from a significant amount of users).

Examples of continuous experimentation in teaching
Fagerholm et al. (2014a) present building blocks for continuous experimentation and
describe the execution of three student projects that aim at conducting build-measure-
learn loops. These projects were performed in cooperation with a start-up and sought to
understand aspects such as future development options or scalability issues of a new service.
The projects helped to evolve the product roadmap and led to several technical pivots where
previous assumptions were invalidated and new options found. Students gained significant
insights in the connections between technical and business considerations. A process model
and infrastructure architecture model for continuous experimentation are described in
Fagerholm et al. (2017).

Kohavi et al. (2012) describe continuous experiments from an industry perspective. The
authors present a system for constant experimentation at Microsoft. They emphasize that
learning addresses many aspects beyond understanding experimentation techniques. For
instance, it is necessary to learn how to identify and understand the reasons for experiments
in an organization. In addition, learning needs to address a change of the company culture
towards experimentation.

Software process simulation
Experimentation is a costly way to learn. It requires, for instance, significant preparation
of experimental materials and treatments. Software process simulation refers to the use
of a simulation model as an abstraction of a real process. Typical purposes for using such
models are experimentation, increased understanding, prediction, decision support, or
education about a process. Assuming that a valid model exists, process simulation promises
advantages with respect to cost. Since part of the process can be conducted virtually, the
number of controlled variables can be much higher than in real experiments, and calibrating
the model to a specific context can be done efficiently.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 11/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Simulation may be a suitable teaching aid in many situations, but should be used only
when a valid model can be obtained. Otherwise, there is a risk that students observe effects
that are not realistic and thus incorrect learning might occur. Well-researched models
with extensive validation are necessary. Other disciplines such as mechanical engineering
or molecular chemistry already use simulation to analyse technologies and processes and
thereby reduce the need for real experiments. In software engineering, this trend is still
focused towards product aspects such as understanding the dynamic behaviour of control
software. However, simulation has already been applied successfully for understanding and
analysing software processes as well as for educational purposes. Process simulation can be
combined with real software engineering experiments—for example, by using empirical
data to calibrate a model or by comparing such data with simulation results—or used
as such.

Simulation-based experiments can be classified by the number of treatments and
the number of subject groups per treatment. In case of single project studies (i.e.,
one treatment and one group), simulation requires initialization of appropriate input
parameters and calibration to the context. In case of multi-project variation (i.e., more
than one treatment and one group), the simulation model needs to be calibrated to
different contexts. Replications (i.e., one treatment but more than one group) basically
refer to several simulation runs, typically with statistically based variations. In case of
blocked subject-project studies (i.e., more than one treatment and more than one team)
simulation model development requires a good understanding of cause–effect relations in
varying contexts. Combining simulation with experiments can be done in the following
ways:

• Empirical knowledge from real experiments can be used for creating the simulation
model (e.g., to calibrate the simulator)
• Results from simulation runs can be used for designing real experiments (e.g., to identify
and investigate new hypotheses before performing expensive real experiments)
• Both can be done in parallel (e.g., to broaden the scope of the experiment).

From the research perspective, software process simulation can be seen as an additional,
efficient mechanism to gain knowledge about the effects of processes in different contexts.
It especially allows for analysing situations that are difficult, expensive, or impossible to
analyse in real experiments, and it allows for flexible variations of the context and the
controlled variables.

From the educational perspective, simulation helps to gain a better understanding
of the dynamics of software development processes, getting immediate feedback, and
experiencing the effects of decisions. Feedback can be obtained quickly using time-lapse
effects. When creating the model, students gain insights into cause–effect- and other
relationships. Creating a simulation model promises to improve understanding of the key
factors and complexity of a specific software process.

From an industry perspective, learning cycles can be accelerated, risks mitigated, and
the impact of processes, technologies, and their changes can be better understood.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 12/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Examples of software process simulation in teaching
Several educational software engineering simulation environments have been developed
and are used for teaching purposes. Examples are the comprehensively evaluated SimSE
environment (Navarro & Van der Hoek, 2007), and the SESAM (Software Engineering
Simulation by Animated Models) environment (Ludewig et al., 1992). Münch, Rombach
& Rus (2003) have developed a laboratory that allows to systematically combine real and
virtual experiments and demonstrate the benefits of such a combination for teaching
purposes.

Individual studies
All aforementioned study types allow for teamwork and training specific team-related skills.
However, software engineering education also comprises several individual tasks, which
are often performed by students while they work (for a limited time) in industry or while
they write their theses (e.g., Bachelor’s or Master’s thesis, or semester projects). Although
individual studies can be performed in industry-academia collaborations, they are usually
conducted by individual students who work on a specific task and simultaneously perform
the study. The student is thus a participant-observer in such studies.

Individual studies depend on the setting in which they are carried out, e.g., requirements
for a semester project differ from those of a Master’s thesis. Different study types can be
applied. Individual studies have high requirements regarding the study design, as they
all have limited resources and strict time constraints in common. Specific challenges
are scoping the study, narrowing down research questions, and defining the expected
outcome. Since individual students conduct single studies, results are often limited to
proofs of concept or demonstrators. Finally, data generated in this kind of study is often
isolated, requiring a defined context to which it can contribute, e.g., a more comprehensive
research strategy within which a particular study investigates one small aspect.

Although limited, the results of the study can have inherent value to research and
practice. For research, the study may contribute to a better understanding of research
questions and might be a starting point for further, more comprehensive studies. From
the industry perspective, individual studies allow investigating a specific problem in a
well-defined environment and, due to their limitations, analyses are limited to a specific
problem. For instance, if the objective of the study is to develop an algorithm or to examine
the feasibility of a specific method, a case study can be conducted explicitly focusing on
this aspect, resulting in a statement which then provides rationale for further investigation.
If the individual study is combined with an internship, students can summarise existing
research on a topical area that can then be used as training material for company employees.
Finally, when students graduate, companies may wish to employ them (they already know
the student).

From an educational perspective, students have to work in a self-organised manner, and
they learn how to set up and conduct a problem-oriented study (including all effects, e.g.,
stakeholder interaction, study planning, data collection, etc.). Furthermore, students get
further specific domain knowledge beyond the more general knowledge they acquire at
university.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 13/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Examples of individual studies in teaching
Rein & Münch (2013) describe an individual student study that was performed as part
of a seminar thesis. The study aims at analysing features of a mobile app and consisted
of design, instrumentation of the app with appropriate measurement instruments, and
analysis of data from more than ten thousand users. The study results provided valuable,
data-based justifications on how to further develop the app. In addition, a new method for
analysing feature value was piloted and experiences with the applicability of the method
were gained.

A popular example for individual studies in industry is the so-called Personal Software
Process (PSP), a training that consists of a series of systematically defined software
engineering exercises. Rombach et al. (2008) have analysed data from 3,090 engineers
conducting the PSP. A major finding from this analysis is that the effects of applying
software engineering principles can be experienced on an individual level. Although the
effects of applying such principles typically can only been seen on the larger scale (e.g., large
projects, long-lasting development efforts, multi-team developments), this study shows
that it is possible to teach these principles also on the individual level.

Further instruments
In the previous sections, we provided an overview of different empirical instruments, a
discussion, and examples of their application in teaching. However, there are further means
that can contribute to the aforementioned study types.

Replication studies
Replication provides an opportunity to learn from an already established research design,
and can, if conducted well, contribute additional evidence for a research question.
Replication repeats empirical studies to solidify their results, test result reproducibility,
increase result validity (e.g., Easterbrook et al. (2008) and Park (2004) consider replication a
kind of triangulation), and broaden research context and scope by repetition under similar
conditions while changing selected variables, e.g., site, population, and instruments.
Thus, students can learn from adapting the research design in a new environment and by
comparing the results obtained to those of the original study. Simultaneously, teachers
should prepare students for sometimes large differences in results. Lack of generalisability
is often cited as a limitation in empirical studies and replication is a step toward creating
generalisable knowledge. However, replication in software engineering is considered
immature and is subject to debate. Juristo & Gómez (2012) argue that results from current
software engineering experiments are often produced by chance, are artificial, and are
too far away from reality. They mention that the key experimental conditions are yet
unknown, as the tiniest change in study design may lead to inexplicable results. Due to the
large number of varying factors in the context of software engineering, it can be questioned
whether close replication is possible at all (Juristo & Vegas, 2011). Nevertheless, conducting
a replication can be a valuable learning experience which develop students’ ability to design
studies and critically compare results of studies addressing the same or similar questions.

Although industry-based research is considered the optimal way to gather reliable
and relevant data, empirical research in industry is hard; replications are even harder.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 14/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


For instance, as we discussed in Kuhrmann (2015), small companies usually have limited
resources to conduct empirical research, as it requires preparation, time, and allocation of
resources. Armbrust et al. (2008) mention the importance of pilot projects in the context
of case study research, and discuss the difficulties finding proper projects and allocating
resources. Replication increases the level of difficulty, as experiments and case studies
are conducted multiple times thus blocking critical resources for long periods of time
without immediately creating value in terms of products and services. While replication
in industry is hard to implement, replicated experiments and case studies are easier to
realize in education since universities provide a stable environment. Deviations from
original designs are usually the subjects, while, e.g., case, instruments, and procedures can
be kept stable. Thus, once a consolidated experiment design is in place, replications can be
implemented on a regular basis. However, the question of whether results obtained with
student subjects can be generalised to industry is then of crucial importance.

Real-life examples and open source software
A major challenge for teachers is providing students with problems of considerable size. One
approach is to rely on Open Source Software (OSS) projects with many publicly available
cases, problems, and code to investigate. They offer complex challenges going beyond
typical local, university-driven projects. OSS projects are distributed and decentralised,
utilising virtual teams in which participants can range from individual volunteers to
professional development teams employed by a company. Furthermore, the practical
relevance of OSS software is unquestioned, since OSS projects set de facto standards for
software development in certain application domains, e.g., operating systems (Linux), web
servers (LAMP stack), and mobile ecosystems (Android). Participating in OSS can benefit
industry by leveraging development capacity for projects exceeding their own capabilities.
In many cases, they must participate in OSS and build products on OSS platforms in order
to access customers who are already using them. Learning how to function in this context,
e.g., working in a self-organising virtual team requires particular knowledge and skills.

Therefore, OSS projects are fruitful grounds to set up a sophisticated teaching
environment. Individual students could directly participate in a single project and
investigate a specific problem, or, in order to achieve more demanding learning goals,
groups of students participate through a collaborative program (Richardson, Milewski &
Mullick, 2006; Keenan, Steele & Jia, 2010). For instance, students from several universities
can participate in a common project pool (e.g., Fagerholm et al., 2014b; Fagerholm, Oza
& Münch, 2013). From the industry perspective, OSS projects offer increased visibility
and opportunities for recruitment, contribution to added features, and sustainability of
key OSS components. From the teaching perspective, OSS projects provide a realistic
learning experience with large software systems, and allow experiencing many aspects of
collaborative software development. For researchers, OSS projects have a large amount of
data available, with easier access than from companies’ internal projects.

Data from OSS projects has been used for several purposes including improvement of
teaching. Other data sources are also available for analysis, providing evidence-based means
to improve teaching. An emerging trend is using learning analytics for supporting good

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 15/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


learning outcomes, to better understand learning progress, and to construct student profiles
for tailored teaching. Ideally, this can allow real-time reaction to improve learning outcomes
and to allow larger masses of students to participate in courses with limited teaching
resources. The main benefits are, for instance, to address the drop-out problem and to
provide more customised teaching for individuals. At the University of Helsinki, an example
of a platform allowing learning analytics is the mooc.fi online learning environment,
which provides courses on cyber security, programming in several languages, web service
development, and algorithms. Research on the platform has, for instance, contributed
methods to identify students based on typing patterns (Longi et al., 2015), which can help
prevent cheating in an online environment, and to identify students in need of assistance,
which allows increasing guidance for struggling students early on, and providing more
challenging assignments for high-performing students (Ahadi et al., 2015).

A GUIDELINE FOR INTEGRATING EMPIRICAL STUDIES
WITH SOFTWARE ENGINEERING COURSES
In this section, we develop an experience-based guideline to integrate empirical studies
with software engineering courses. We base the guideline on experiences gathered from our
own software engineering courses, categorised from several perspectives. We first generalise
common purposes, challenges, and validity considerations. These serve to determine the
appropriateness of a particular study in a given context. We discuss appropriateness from
two perspectives: (1) teaching at universities and (2) industry training. Finally, we share
our experiences, discussing several aspects to be considered when integrating empirical
studies with software engineering courses, e.g., motivation, scheduling, and effort.

Purposes, challenges, and validity
To summarise the aforementioned kinds of empirical studies, we create the taxonomy
presented in Tables 2–4. In this initial taxonomy, we include different purposes, challenges,
and validity constraints to support the categorisation of study types, and the analysis of
appropriateness in certain contexts. We identified a total of ten purposes, describing major
learning goals associated with empirical studies in software engineering teaching that we
consider important (Table 2). Complementing the purposes, we identified eight challenges
that should be taken into account when designing empirical studies for educational
purposes (Table 3). Purposes and challenges are intended to help teachers determine what
study type is appropriate in a certain setting, e.g., does the actual setting allow for a (full)
experiment, and if so, what challenges need to be addressed?

Apart from purposes and challenges, the quality of the outcomes of empirical studies—
especially in the context of teaching using students as subjects (Runeson, 2003)—must
be considered carefully. Taking the close relation to industry and relevance of the
topics into account, we analysed the different study types for validity constraints. For
example, researchers seek validity to solidify findings and to pave the way for generalisable
knowledge, while industry is interested in business value. Furthermore, from a teaching
perspective, result validity may be considered less important than achieving the learning
goals. Therefore, we derived four validity considerations associated with empirical studies

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 16/36

https://peerj.com
http://mooc.fi/
http://dx.doi.org/10.7717/peerj-cs.131


Table 2 Summary of learning goals and purposes for empirical studies in education.

Purpose Description

P01 Learn to formulate a research problem. Students face a (real-world) problem that needs investigation. Therefore, the learning goal is to:
•Capture the problem.
•Formulate research questions.
•Formulate hypotheses regarding users or customers and their behaviour.
Due to the complexity of realistic, real-world settings, this task is demanding for, e.g., formulating a problem in a scientifically sound
way, but keeping in mind the (industry) partners’ needs.

P02 Learn to collect relevant data. Collecting data in realistic settings is a demanding task, as data is usually scattered across different
sources. The learning goal is to develop a meaningful data collection strategy that includes data from multiple sources within a setting
and, optionally, backed up by further external data (from outside a given setting).

P03 Learn to analyse real-life data. Real world situations are often incomplete or confidential thus hampering data analyses. The learning
goal is to develop a data analysis strategy to overcome limited data.

P04 Learn to draw conclusions. Based on collected and analysed data, the overall learning goal is to draw conclusions. Thus, in the (realis-
tic) setting, students need to learn to:
•Gather empirical evidence on which conclusions are based.
•Test theories and/or conventional wisdom based on evidence.
•Draw conclusions from (limited) data and develop a strategy to utilise findings in practice.
The purpose is to gather findings or evidence, and to analyse the findings for relevance in the respective setting. Eventually, findings
must contribute to solving the original problem and, thus, another learning goal is to develop transfer strategies to support utilisation
of the findings in practice.

P05 Learn to experience and solve a specific problem. A major purpose is to cause people to experience certain situations and to develop
situation-specific solution strategies or approaches. This leads to:
•Experience regarding the problem-solution relation, e.g., understanding of the relationship between user behaviour and software
design.
• Increased knowledge about a problem (domain).
• Increased knowledge about technology and methods.
• Increased knowledge about potential/feasible solutions and/or solution patterns.
Skills addressed by this learning goal are basic prerequisites that allow for developing solutions in general, as these skills address a
specific problem, but also allow for developing transferable knowledge that can be applied to different contexts.

P06 Develop a software artefact. In software engineering, software artefacts, especially prototypes, serve the (early) analysis of a specific
problem. For this, prototypes allow for implementing and demonstrating solution strategies. The learning goal thus comprises:
•Create a (software) prototype to demonstrate a solution approach/strategy (feasibility study).
•Create artefacts to elaborate potential solution approaches/strategies for dis-/advantages (comparative study).
•Create artefacts to establish (quick) communication and feedback loops.
Software artefacts in general and prototypes in particular serve the elaboration of a problem, and help to understand the potential so-
lutions. That is, such artefacts pave the way to the final solution.

P07 Coaching. Another learning goal is to make stakeholders familiar with new methods and tools. Hence, utilisation of the new method-
s/tools need to be trained, i.e., develop and train necessary skills.

P08 Change of culture. Continuous experimentation comprises a number of the other learning goals. However, continuous experimenta-
tion is more of a general organisational question than a project-specific endeavour. Therefore, utilising continuous experimentation
also implies a cultural change toward experimentation in the implementing organisation.

P09 Learn about the impact. Specific behaviour or decisions impact a system and/or a team, e.g., changing requirements or fluctuation in
team composition. Therefore, it is important to learn about the effects that certain behaviour and decisions have in large and/or dy-
namic contexts.

P10 Learn about long-term effects. Apparently ‘‘local’’ decisions might cause ‘‘global’’ effects. Thus, it is important to know about the long-
term and/or snowballing effects caused by single decisions, e.g., a shortcut in the architecture leads to increased maintenance cost
(technical debt).

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 17/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Table 3 Summary of challenges that empirical studies in education face.

Challenge Description

C01 Finding or creating relevant cases. The major challenge is to find and define proper and relevant cases, which bears some risks:
•A case may become irrelevant while conducting a study (e.g., changing environment, changing context parameters).
•A study might go into an unexpected direction (learning curve and, in response, focus shift).
•A relevant case must be narrowed down to the participating subjects, e.g., students have different skills and goals than professionals.
Cases must be balanced, e.g., learning goals must be achieved regardless whether the original case loses its relevance (procedural over
technical knowledge) or students need to finish a thesis regardless of whether industry partners can apply study findings.

C02 No guaranteed outcome. If a problem was found, there is no guarantee that a study will lead to an outcome. Furthermore, immediate
applicability of the outcomes is not guaranteed, which means extra work for industry to transfer results into product development.

C03 Time constraints. Apart from the appropriateness of the actual problem, time constrains limit the study. Time constraints can occur
as:
•Limitations dictated by the curriculum/course schedule.
•Limitations dictated by industry schedules, e.g., product development cycles.
•Limitations dictated by individual schedules, e.g., students that are about to finish their studies.
Therefore, time constrains, together with resource limitations, define the basic parameters that affect the study objects (problem, po-
tential/achievable solutions, completeness of results, validity of outcomes, and so forth).

C04 Resource limitations. Studies require resources and, thus, availability of resources limits the study. Resource limitation can occur as:
•Availability of (the right) students, e.g., if a study requires students with a specific skill profile.
•Motivation of students to participate in a study (personal vs. study goals).
•Availability of industry resources (personnel tied to a study).
•Options to adequately integrate the study with (running) company processes.
Especially the availability is a critical factor. For instance, while one experiment consumes resources once, repetition and replication
require a long-term commitment regarding resource availability, which implies significant investments of time and/or money. In or-
der to make resources available, participating partners need to receive a sufficient benefit, which is often hard to define in empirical
studies.

C05 Limited access to data. Although it is one purpose in terms of learning goals, defining adequate hypotheses and variables that can
be investigated in a course is challenging. Proper measurements must be defined, taking into account that potentially not all data is
available, e.g., confidential data. Especially access to user data is challenging (a way out could be utilising OSS projects), as this data is
usually strictly confidential.

C06 Built-in bias. A special problem is bias. Each particular setting comes with an inherent set of biases, e.g.:
•Students’ special skills affect the study, and students that are trained in advance of the study affect the outcomes.
•Too much or too little context knowledge of the subjects affects the study.
•Competing goals of the participants (especially students vs. practitioners) affect the study, e.g., students might try to optimise a
study to achieve better grades while compromising the study goals.
Empirical studies suffer from certain limitations, and in the context of teaching, special attention needs to be paid to bias and threats
to validity.

C07 Communication. Empirical investigations create knowledge, data, and potentially software artefacts. Therefore, results need to be
quickly communicated to the participants. Quick feedback helps to, e.g., determine the relevance of results, appropriateness of the in-
strument, and determining necessary adjustments. Thus, fast feedback loops are necessary.

C08 Creating a Simulation Model. For simulation-based research/teaching, setting up a simulation model is a demanding task, which con-
sumes time and resources thus generating cost. The entire domain under consideration must be captured to create a model that al-
lows for generating useful data.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 18/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Table 4 Summary of validity considerations when using empirical studies in education.

Validity consideration Description

V01 Emphasis: meeting teaching goals. Use is valid if procedural learning goals are met. Validity of conclusions and useful-
ness for industry are of secondary importance from the teaching perspective.

V02 Emphasis: meeting business or organisational goals. Validity depends on what value is created for the business. Direct
business value is rare; a more likely result is increased knowledge of the problem area, technology, work methods, or
potential solutions or solution patterns.

V03 Emphasis: creating a sound study design. Focus is on the internal validity, and results of a study are ‘‘side effects’’. If
the learning goal is to understand experimentation itself, internal and external validity could have higher relevance.

V04 Emphasis: meeting research goals. Especially in simulation, the validity of the gathered data depends on the quality of
the simulation model, and also on the quality of the simulation environment.

for teaching from the desired learning outcomes and our knowledge about the educational
context (Table 4).

Establishing context and goals, and determining a study type
We provide an initial assignment to the four major study types experiment, case study,
simulation, and continuous experimentation. Individual studies are left out, since the
particular challenges result from the concrete instrument applied in the respective study,
e.g., an individual study may implement an experiment or a case study.

Table 5 provides an initial assignment of purposes, challenges, and validity constraints
from the academic perspective, while Table 6 provides the industry perspective. The
tables support decision-making when selecting appropriate instruments. For instance,
if the context is university, and students shall learn to solve a particular problem (P05)
by developing a software tool (P06), teachers should opt for a case study. In an industry
context, both case studies and experiments can be utilized. However, as Table 6 illustrates,
an industry experiment is more demanding, with more challenges to address, e.g., built-in
bias (C06) and communication (C07), and a different validity emphasis (V02). Another
observation is that there are no differences suggested between the two settings in the case
of the simulation instrument. In both, the major challenge is the simulation model, which
affects learning, effort for its creation (C08), and validity constraints (V04).

Concrete goals must be considered and balanced alongside contextual information when
selecting a particular study type. This also includes goals going beyond classic learning
goals. For instance, all stakeholders (students, teachers, and industry partners) come into
contact when performing case studies in industry, allowing for several opportunities.
Students make contact with industry and could find a job. Students team up with other
students to create an idea which may eventually lead to the founding of a company. On
the other hand, industry can conduct research cheaply, as they usually pay with time spent,
sponsor an idea, or pay a small fee to keep a software engineering lab running. In either
case, industry gets access to the latest knowledge and fresh resources. Finally, researchers
have the opportunity to conduct some research (given the limitations mentioned above).

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 19/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Table 5 Education in academia/university.

Experiment Case study Continuous
experimentation

Simulation

Purpose P01, P03, P04 P01, P02, P03, P04, P05, P06 P01, P02, P03, P04 P09, P10
Challenges C01, C03, C04, C05 C01, C02, C03, C04, P05 C01, C03, C04, C05 C01, C05, C08
Validity V01 V01, V02 V03 V04

Table 6 Education in industry.

Experiment Case study Continuous
experimentation

Simulation

Purpose P04, P05, P06 P05, P06 P02, P03, P06, P07, P08 P09, P10
Challenges C01, C03, C04, C05, C06, C07 C01, C02, C03, C04 C03, C04 C05, C07 C01, C05, C08
Validity V02 V02 V02 V04

Motivating students to conduct empirical studies
Making contact with industry and the prospects of finding a job or starting a company,
foster students’ motivation to actively participate in courses, and may contribute to higher
motivation, engagement, and better understanding of course contents. For instance, in
Kuhrmann, Fernández & Münch (2013), we reported on a new teaching format applied to
a software process modelling course, including an empirical study. The course evaluation
showed that students experienced the course significantly better than the class before
(without the study), although they perceived the course as more demanding (see ‘Example
1: a course on software process modelling with and without experiments’). The evaluation
showed that students understood contents and their relevance better, gathered advanced
knowledge and learned to apply it experiencing practical effects, e.g., consequences of
wrong design decisions. Our experience also shows that encouraging students to develop
ideas and create products boosts motivation (c.f. Brügge, Krusche & Alperowitz, 2015). For
instance, smart phone apps can be developed in collaboration with industry partners and
published in an app-store. Gaining visibility, real clients, real feedback, and real bug reports
guides students through the whole software development and product life cycle.

Nevertheless, apart from all potentially positive motivating drivers, a major driver for
students is to get the best possible grade. Also, the number of credits must reflect the effort
required to conduct the study. For students, the amount of time required to receive a
credit point is an important consideration. Since empirical studies are demanding in terms
of effort, and credits form the compensation, software engineering courses that include
empirical studies must adequately ‘‘remunerate’’ the students for their efforts.

Scheduling
Having defined the goals and acquired (motivated) students and, optionally, partners from
industry, the challenges C03 and C04 (Table 3) must be addressed. Planning empirical
studies in a standard university curriculum is demanding, as students usually take several
courses thus having limited resources. Furthermore, courses often span 12–15 weeks and
if industry is involved, their schedules must be respected as well. In Kuhrmann (2012), we

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 20/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


provided a generic template that integrates classic teaching with explicit workshop slots,
which can be used to conduct empirical studies. In Kuhrmann, Fernández & Münch (2013)
and Kuhrmann, Femmer & Eckhardt (2014), we provided concrete instances and reported
on the feasibility of the proposed template. However, conducting empirical studies in
collaboration with industry requires refining the generic template. We consider three basic
planning patterns appropriate:

• Workshop model: In the workshop model, teachers, students, and practitioners conduct
a workshop in which they collaboratively work on a problem. An example is a lab-based
environment, such as the Software Factory (Fagerholm, Oza & Münch, 2013). Moreover,
the workshop model is quite common in industry training (usually 1–5 days); a study
that fits that schedule is more likely accepted by industry partners.
• Interleaved model: The interleaved model allows conducting a ‘‘long-running’’ study.
Normal work slots alternate with workshop slots, e.g., a new method is deployed and
trained, practitioners apply it, researchers evaluate, improve and/or train new aspects
of it, practitioners continue application, and so forth. Furthermore, this model proved
beneficial when supervising students conducting individual studies in industry. There
are several benefits: regular work is not disturbed over a long period; training can be
done iteratively; and cases can be observed over a longer time period. However, course
schedules limit the applicability with student groups.
• Observation model: This model is the classic research model adopted for education
purposes. Students or practitioners are instructed, work independently on a task, and get
coaching from teachers. Besides the coaching, teachers monitor the correct application
of empirical methods to collect and analyse data.

Planning the study and justifying the study plan with all time constraints needs to be
done carefully, and requires the commitment of all participants to ensure availability of
personnel and resources.

Further study type selection criteria
Apart from the criteria already discussed, we wish to highlight some further criteria that
may influence the selection of study types for educational purposes. First, in Table 7,
we summarize well-known criteria from literature (e.g., Wohlin et al., 2012) and further
criteria that we consider relevant for study type selection. The table includes an experience-
based rating for the criteria. However, this rating has to be considered as a subjective
recommendation, as it is hard to precisely define, e.g., the degree of motivation or student
satisfaction.

We note that the knowledge and skill level of students should also be taken into account
when selecting and tailoring an empirical instrument for teaching. In Table 8, we provide
an experience-based assessment of how different study types can be adjusted for different
levels of students. Two student levels are considered: Bachelor’s (0–3 years of study) and
Master’s (3–5 years of study) levels. In industry, these may be interpreted either based
on employees’ level of education or their working experience in the field. The primary
means of adjustment is selection of a suitable problem scope and setting expectations for

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 21/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Table 7 Further study selection criteria for different study types. Each study type is ranked relative to the others on three levels and may span more than one level
(LO: low, ME: medium, HI: high).

Experiment Case study Continuous
experimentation

Simulation Individual
studies

LO ME HI LO ME HI LO ME HI LO ME HI LO ME HI

Degree of execution control × × × × × × × ×
Degree of measurement control × × × × × × × ×
Degree of validity × × × × × × × × × ×
Motivation to participate in a study × × × × × × × ×
Motivation created by the study × × × × × × × × ×
Student satisfaction × × × × × × × × ×
Scheduling effort × × × × ×
Ease of goal definition × × × × × × × × ×
Effort to prepare/conduct a study × × × × ×

Fagerholm
etal.

(2017),P
eerJ

C
om

put.S
ci.,D

O
I10.7717/peerj-cs.131

22/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Table 8 Adjusting study types to student levels. Length of study is indicative and based here on European standards.

Bachelor’s level (first 3 years of study) Master’s level (between 3–5 years of study)

Experiment Simple experiments with few variables. Ex-
periment design given.

More complex multivariate experiments.
Own experiment design.

Case Study Limited topics, restricted to chosen context,
few informants. Little or no generalisation.
Exploratory, descriptive or intrinsic case
studies.

Topics related to well specified software en-
gineering areas. Some generalisation. Limita-
tions of generalisation fully analysed. All case
study types.

Continuous Experimentation Rudimentary practice with synthetic sce-
narios. Focus on understanding basic steps
such as identifying assumptions, creating hy-
potheses, and collecting data.

More advanced scenarios or limited real-life
experiments. Focus on drawing conclusions
from data and understanding limitations.

Simulation Using ready-made simulation models and
given data to explore topics through simula-
tion.

Exploring the effect of changes in models us-
ing given data or how ready-made models be-
have with student-collected data. Some ex-
ploration with creating simulation models.

Individual Studies Focus on finding and summarising existing
research.

Focus on answering specific research prob-
lems by applying existing research and own
data collection. No requirement of scientifi-
cally novel results.

appropriate result scope. For example, case studies at the Bachelor’s level can be more
limited in scope and focus on exploratory, descriptive, or intrinsic designs without much
generalisation beyond the case environments. On the Master’s level, some generalisation
can be expected although still limited. Assessment of the possibilities to generalise can be
expected at this level. This assessment must be considered as a subjective starting point for
adjustment, as students are different and educators should, as far as possible, tailor courses
for individuals in order to provide the best opportunities for learning.

EXPERIENCES
In this section, we provide some experiences gathered from implementing empirical
instruments in university teaching. We provide selected examples, outline the respective
courses (purpose, approach, outcomes), and provide feedback and evaluation (formal as
well as informal) to reflect the students’ perception of these courses.

Example 1: a course on software process modelling with and without
experiments
A course on software process modelling, which implements the curriculum presented in
Kuhrmann, Fernández & Münch (2013), serves as first example. The course was offered
multiple times at the Technische Universität München (TUM) and the University of
Helsinki. In Munich, after the initial run, the course was reorganized according to the
concept presented in Kuhrmann (2012) in which we presented an approach to integrate
experimentation with practical Software Engineering courses. Due to the reorganization,
students experienced the (abstract) topics while conducting a controlled experiment
on which we reported in Kuhrmann, Fernandez & Knapp (2013). Moreover, due to the

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 23/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


1Note that in Table 9, smaller scores are
better.

2Since we informed the students about the
‘‘experimental’’ character of this special
course in advance, the students did not
complain, but welcomed the opportunity
to give the feedback to improve their own
class.

Table 9 Formal evaluation (anonymous questionnaire, comparison winter 2010/2011 and 2011/2012, TUM, result interpretation: ↑ large im-
provement, 1, small improvement; →, no change; ↘, small deterioration; ↓, large deterioration.)

Criterion Winter 2010/2011 Winter2011/2012 Result

Number of completed questionnaires 6 (9 participants) 8 (14 participants) –

Common criteria (1 = very high, 5 = very low)
Complexity 3.00 (old questionnaire: ‘‘level’’) 2.38 −0.62↑
Volume 2.12 −0.71↑
Speed

2.83 (old: one question)
2.75 −0.08→

Appropriateness of effort compared to ECTS points n.a. 3.00 n.a.

Overall rating (1 = very good, 5 = very bad)
Lecture 1.25 1.5 +0.25↘
Exercise 2.17 1.33 −0.84↑
Relation to practice 2.0 1.62 −0.38 1

repeated execution in which we applied a course structure both without and with empirical
instruments, we can present a number of experiences and a comparison.

Formal evaluation
In Table 9, we present the comparison based on the formal course evaluations conducted
by the Faculty of Informatics at TUM. Due to updated questionnaires, the evaluations
are not directly comparable. However, the basic information can still be extracted. The
formal evaluation shows a significant increase of the scores1 regarding exercise quality and
relation to practice, although, at the same time, the students also see the lecture as more
demanding. Since the basic course contents did not change, we interpret this evaluation
as an increased awareness toward the course topic, which might be caused by the stronger
utilization of practical aspects through the experiment. We see this as an indication that
introducing experiments could have a positive effect, although a full validation is missing.

Informal evaluation
Besides the formal faculty-driven evaluation, we also performed two informal feedback
rounds in the course instance in which we adopted the empirical instruments. We asked
the students to write a one-minute-paper that contained the following three questions to
be answered in short words:
1. (up to 5) points that are positive
2. (up to 5) points that are negative
3. (up to 5) points that I still wanted to say (informal)
Table 10 shows the summarized results from the informal evaluation: The structure

of the class, the selected topics, the combination of theory and practice, and the way
of continuously evaluating the work and finding the final grades were rated positively.
Especially the practical projects and the team work in the workshops were highlighted.
On the other hand, students mentioned the tough schedule and the not always optimal
tailoring of tasks2 for the practical sessions.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 24/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


3In the Star Trek franchise, Kobayashi
Maru is a leadership test with a no-win
scenario; see https://en.wikipedia.org/wiki/
Kobayashi_Maru.

Table 10 Summarized evaluation of the one-minute-papers (winter 2011/2012, TUM).

Positive Aspects
Structure of the topics and the class,
Combination of theory and practice,
Projects in teams (atmosphere),
Self-motivation due to presentations,
Continuous evaluation and finding of the final grades
Negative Aspects
Tough schedule,
Tailoring of the tasks for the practical sessions was not always optimal
Students signed off, just because of the examination procedure
Informal
‘‘Thank you, this was the lecture I learned the most.’’
‘‘Super class, and I loved those many samples from practice.’’

Example 2: a course on agile project management and software
development with experiments
The second example is an advanced course on agile software project management, which
is also grounded in the general course pattern presented in Kuhrmann (2012). A detailed
description of the course and data obtained from the experiments is provided in Kuhrmann,
Femmer & Eckhardt (2014). In this course, offered at the Technische Universität München,
the main purpose of the experiment instrument was to create awareness—scientific results
were not the objective. We implemented two experiments:

Experiment 1 (Group Dynamics) The first experiment aimed at demonstrating how
groups of people collaborate in teams under stress (Kuhrmann & Münch, 2016b).
Therefore, we introduced the Tuckman Model (Tuckman, 1965), which describes group
formation processes, and designed a simple experiment in which the students had to sort
sweets and to document the outcomes. During the different experiment runs, we put
some pressure on the students, e.g., increasing task complexity, enforced turnover, and
external disturbances. Although this experiment did not aim at finding new scientific
revelations, we could confirm the Tuckman Model and show that group performance
suffers from turnover.

Experiment 2 (Distributed Development) The second experiment was designed to give
the students the opportunity to deal with hopeless situations (Kuhrmann & Münch,
2016a): we designed a Software Engineering ‘‘Kobayashi Maru’’ test.3 Students were
separated into two sets, each consisting of two groups for a total of four groups. Each
group had to develop a very simple console-based chat application (requirements in the
form of user stories and test cases were provided), the groups were separated (each was
located in a separate room), and each group was allowed to use only one communication
channel (e-mail and Skype respectively). After the task had been presented to them, the
groups were immediately separated to avoid any direct communication, and for each
group, a researcher monitored the compliance to the experiment rules. As the students

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 25/36

https://en.wikipedia.org/wiki/Kobayashi_Maru
https://en.wikipedia.org/wiki/Kobayashi_Maru
https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Table 11 Formal evaluation (anonymous questionnaire, winter 2012/2013, TUM).

Criterion Winter 2012/2013

Number of completed questionnaires 15 (19 participants)

Common criteria (1 = very high, 5 = very low)
Complexity 3.27 (just right)
Volume 2.93 (just right)
Speed 3.07 (just right)
Appropriateness of effort compared to ECTS points 3.07 (just right)

Overall rating (1 = very good, 5 = very bad)
Lecture 1.33
Exercise 1.40
Relation to practice 1.57

Table 12 Summarized evaluation of the one-minute-papers (winter 2012/2013, TUM).

Informal

‘‘Course discusses topics of interest from practical point of view.’’
‘‘I like the practical approach of teaching.’’
‘‘By far one of my favorite courses at all. Very interactive and relaxed atmosphere. Great exercises.’’
‘‘Interactive, student presentations, experiments.’’
‘‘Applicability of the course immediately in my work for other software projects.’’

did not have the chance to initially find some agreements, the projects were failures-by-
design. The students immediately started to work (they had only 90 min to develop the
working software), yet, nobody came up with the idea to negotiate a communication
protocol first. Therefore, after the deadline, no group could show any working software.
In a closing feedback session, we revealed the nature of the experiment and discussed
the observations.

Formal evaluation
In Table 11, we present the comparison based on the formal course evaluations conducted
by the Faculty of Informatics. Although we have only one evaluated instance of this course,
we use the same structure as in Table 9 to present the data. The evaluation shows this
course to be on approximately the same level as the improved software process modelling
course.

Informal evaluation
Besides the formal faculty-driven evaluation, we again performed two informal feedback
rounds in the course. We asked the students to write a one-minute-paper (see above). Since
the outcomes are actually the same as already presented in Table 10, we only present the
informal comments (third question) in Table 12.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 26/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Example 3: master’s theses using a case study approach
Master’s theses provide opportunities for individual students to apply a specific research
approach to a chosen problem. The third example comes from a selection of Master’s thesis
projects which we have supervised at the University of Helsinki, all of which use a case
study approach. They are therefore examples of both individual studies and case studies.

The first thesis project investigated a software prototype game and applied usability
and user experience evaluation methods to determine whether it fulfilled two sets of
criteria: the entertainment value of the game, and the ability to tag photos as a side effect
of playing the game. The game itself was implemented by a student team in cooperation
with a company, and the thesis writer was part of the implementation team. In this thesis,
the game constituted the case and four sources of evidence were used: user interviews,
in-game data collection, a questionnaire, and observations from a think-aloud playing
session. The thesis can be characterised as an intrinsic case study (Stake, 1995; Baxter &
Jack, 2008), since the objective was not to gain understanding of an abstract construct or
general phenomenon, nor to build a theory. Rather, the case itself was of interest, and the
results were suitable for making further decisions regarding development of a full game
based on the prototype.

The second thesis project investigated continuous delivery and continuous
experimentation in the B2B domain. The objective was to analyse challenges, benefits,
and organisational aspects in a concrete company case. The thesis writer was an employee
in the company and was thus a participant observer. In this thesis, the case consisted of the
development process used by two teams for two separate software products. Two sources of
evidence were used: participant observation and interviews with 12 team members, six in
each of the two teams. The thesis can be characterised as an exploratory deductive case study,
where the aim was to explore how continuous delivery and continuous experimentation
could be applied in the company and what challenges and success factors are encountered.
The thesis aimed to generalise and provide results that could be adapted to other B2B
companies.

The third thesis project investigated the state of the practice of experiment-driven
software product and service development. The objective was to understand the state of the
practice of continuous experimentation and to identify key challenges and success factors
in adopting continuous experimentation. The thesis can be characterised as a qualitative
survey design, which resembles a case study but relies on a single source of evidence. In that
sense, the thesis was close to an intrinsic case study, as it aimed to develop a multifaceted
understanding of the topic rather than develop theory. The thesis used material from 13
interviews in 10 software companies. The result of the thesis was a rich picture of the state
of the art concerning experiment-driven software development in the case companies.
Although the primary aim was not to generalise, the results were relevant as comparison
points in other companies.

Informal evaluation
Utilising a case study approach in the Masters’ theses provided the opportunity to
investigate highly relevant problems in their natural context. Each thesis gained from

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 27/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


having an industrial connection which provided real-life constraints, questions, and data.
In each thesis, the student had to consider the setting, objectives, questions, methods, data
collection, and analysis procedures and adjust the general case study research method to
their particular implementation. We observed high motivation among the students, timely
completion of subtasks and of the thesis as a whole, and clear maturation with a complex
individual project. Two of the theses were developed into scientific papers that have been
published in peer-reviewed forums.

Based on these examples, the difficulties related to case studies can be summarised into
three categories. First, finding and scoping a relevant research problem can be difficult
for many students, as they do not have the necessary overview of the present literature
that is needed. The role of the advisor is of prime importance in the beginning: helping
to formulate the research questions and pinpoint what the case or unit of analysis is.
Second, understanding case study research as a method can take a long time without
proper guidance. Providing relevant method literature, identifying the key concepts, and
providing an understanding of how to implement the method in practice —designing the
study—are areas where the advisor can help. The data collection is usually interesting and
straightforward, perhaps with some practical challenges related to finding data sources. As
these can often be overcome by some persistence, the third category is related to performing
the analysis and writing up the case report or thesis. Students do not often have a chance to
practice these skills on a regular basis, and thus there are many questions regarding analysis
choices and patterns for writing up results that an advisor may be able to help with.

Although we rely here only on informal evaluation, these examples have convinced us
that case studies of different types are well suited as teaching tools. They require a wide
range of skills which the students must acquire, and these skills are applicable in many
other settings as well. Perhaps the most important insight to be gained from conducting
case studies is that students are faced with a wide variety of data that challenge their
preconceptions and develop their ability to observe phenomena in their real-life context.

DISCUSSION
Implementing a course using empirical instruments provided us with a number of insights.
Related to the scientific and organizational perspective, we learned that course preparation
causes more effort compared to classic teaching. First, the examples and cases to be used
in experiments need to be tailored accordingly: there must be time restrictions due to
schedule requirements. This has two major impacts. First, the investigated topic is of
reduced complexity, which causes it to be less realistic. Second, research questions must be
carefully selected for reasonable treatment within time constraints. Therefore, we consider
explorative (curiosity-driven) or confirmative experiments meaningful, i.e., experiments
of low criticality.

From the teaching perspective, we experienced that the choice of a real world example
rather than an artificial toy example has proved to be successful. For example, the
experiment outcome from Kuhrmann, Fernandez & Knapp (2013) was a fully implemented
process to which the process owner stated that he did not expect the student groups to create

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 28/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


‘‘such a comprehensive solution in this little time.’’ Another goal—‘‘let students experience
the consequences of their decisions’’—was also achieved. For instance, in the course on
software process modelling, while implementing the process in a workshop session, we
could observe a certain learning curve. One team had a complete design, but selected an
inappropriate modelling concept. Later, the team had to refactor the implementation,
which was an annoying and time-consuming task, both increasing their awareness of the
consequences of certain design decisions. Furthermore, students also experienced how
difficult it is to transform informal information or tacit knowledge into process models.
The students could also see how difficult it is for individuals to formulate their behaviour
in a rule-oriented manner.

For the course on software process modelling in Munich, we compared the final grades
of both courses and recognized significantly better grades in the second run. During course
exams, the students could not only answer all (theoretical) knowledge-related questions,
but also all knowledge-transfer and application-related questions. The students usually
referred to the practical examples and were able to transfer and apply their experiences to
new situations.

Finally, the case study-based Master’s theses allowed our students to be embedded in
projects with real-life connections. Apart from their educational value for the students, they
contributed to the scientific literature and helped students in their early careers. Although
our industry connections were important in obtaining the cases, the students themselves
learned to be self-directed in their work and gained significant domain knowledge. As
thesis supervisors, we found that there was some additional effort in introducing case
study methodology to students—methodology courses do not fully prepare students to
actually carry out a study of their own, which is to be expected. However, being embedded
in the project and receiving feedback from the project environment and its stakeholders
meant that it was easy to convince students of the necessity of a structured approach.
Once students were up to speed, the extra supervision effort was compensated by more
autonomous work on the students’ part.

Limitations
The guideline presented in this paper has not been systematically tested in different
learning environments. Instead, it represents a starting point based on reflection grounded
in teaching practice. We consider the limitations of the study in terms of qualitative criteria
for validity (c.f. Creswell, 2009).

Internal validity concerns the congruence between findings and reality. In this study,
internal validity then concerns how credible the guidelines are in light of the realities of
software engineering education. As that reality is constantly changing, the match between
guidelines and teaching can never be perfect. Our study has applied triangulation to
increase the internal validity of the results. We have utilised several types of teaching in
different modes and in different universities, and with different teachers, to obtain a richer
set of experiences to draw guidelines from.

External validity refers to the extent to which findings can be applied to other situations.
As our aim is not theory testing, external validity in this article is about enhancing, as far

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 29/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


as possible, the transferability of the results. We argue that the guideline developed herein
covers a wide range of teaching and learning situations, and thus can be applied widely
in graduate and undergraduate education in software engineering. We have attempted to
elucidate the limitations of applying the guideline by mapping study types differently to
education in academia and industry, and to different purposes, challenges, and validity
concerns of interest to teachers. In addition to these limitations, we see that there are
certain situations where the guideline would be unsuitable. First, when the execution of an
empirical study would cause ethical problems or legal consequences for any of the involved
parties; in this case, the teacher should direct the student to a different task. Second, the
guideline relies on the teacher to assess whether a particular student possesses the necessary
prerequisite skills to carry out a particular study; the guideline is not transferable if that
information is missing. Third, the guideline makes certain assumptions about the learning
environment, such as availability of industry partners for Master’s degree projects, and the
availability of certain teaching resources for other study types. When attempting to apply
the guideline, teachers should consider whether the necessary resources are available.

CONCLUSION
There is a lack of guidance on how to use empirical studies in software engineering
education. In order to address this gap, this paper provides an overview of different types
of empirical studies, their suitability for use in education, as well as challenges with respect
to their execution. We analysed our own teaching and the different studies that we applied
as part of it, and reported on selected studies from existing literature. Rather than having
students conduct pure research, we opt for including different empirical instruments into
software engineering courses as means to stimulate learning.

The present paper provides an initial systematisation of empirical instruments from the
educational perspective. We derived a set of purposes and challenges relevant for selecting
a particular study type. Furthermore, we also discussed validity constraints regarding the
results of course-integrated studies. Based on our experiences, we assigned the different
purposes, challenges, and validity constraints to the different study types, and we provided
further discussion on motivation and scheduling issues. We also defined a set of further
study selection criteria to provide an initial guideline that helps teachers to select and
include empirical studies in their courses. We believe the guideline could be used in a wide
variety of settings. We note that the guideline is limited in that it considers a limited number
of study types and learning outcomes—those that they authors have experience with as
teaching aids and study purposes. They may not be suitable in situations where significantly
different study types or learning outcomes are called for. Since, to the best of our knowledge,
no comparable guidelines exist, we cordially invite teachers and researchers to discuss and
improve on this proposal. In particular, future work could focus on applying the guidelines
in different kinds of software engineering courses and programs, both within academic
university education and in industry training. The purposes, challenges, and constraints
presented here could thus be further validated, refined, and perhaps extended.

Another particular consideration is how to perform student assessment when using
empirical studies for educational purposes, in particularly when group work is involved.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 30/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


What should be assessed, how should assessment be performed fairly when many students
are involved, and how should, e.g., knowledge of empirical methods, domain knowledge,
procedural knowledge, and the quality of outcomes be balanced in the assessment? We
believe that the purposes and validity considerations in Tables 2 and 4 could serve as a
starting point for creating rubrics that are relevant for this type of teaching.

Finally, further studies are needed to test the effectiveness of courses using the proposed
approaches in terms of their ability to teach. The learning outcomes of such courses should
be further explored: beyond what is currently known, what do students learn by conducting
empirical studies, and how do their learning outcomes differ from other approaches to
software engineering education?

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
This work was supported by Tekes, the Finnish Funding Agency for Technology and
Innovation, as part of the N4S Program of DIGILE (Finnish Strategic Centre for Science,
Technology and Innovation in the field of ICT and digital business). The funders had no
role in study design, data collection and analysis, decision to publish, or preparation of the
manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
Tekes, the Finnish Funding Agency for Technology and Innovation.

Competing Interests
The authors declare there are no competing interests.

Author Contributions
• Fabian Fagerholm, Marco Kuhrmann and Jürgen Münch conceived and designed
the experiments, performed the experiments, analyzed the data, contributed
reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables,
performed the computation work, reviewed drafts of the paper.

Data Availability
The following information was supplied regarding data availability:

The raw data is included in the tables.

REFERENCES
Ahadi A, Lister R, Haapala H, Vihavainen A. 2015. Exploring machine learning methods

to automatically identify students in need of assistance. In: Proceedings of the eleventh
annual international conference on international computing education research, ICER
’15. New York: ACM, 121–130.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 31/36

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.131


Armbrust O, Ebell J, Hammerschall U, Münch J, Thoma D. 2008. Experiences and
results from tailoring and deploying a large process standard in a company. Software
Process: Improvement and Practice 13(4):301–309.

Ball S, Emerson T, Lewis J, Swarthout JT. 2012. Classroom experiments. Available at
http://serc.carleton.edu/sp/library/experiments/index.html (accessed on 20 January
2016).

Barrows HS, Tamblyn RM. 1980. Problem-based learning: an approach to medical
education. New York: Springer Publishing Company, Inc.

Basili V, Selby R, Hutchens D. 1986. Experimentation in software engineering. IEEE
Transactions on Software Engineering 12(7):733–743
DOI 10.1109/TSE.1986.6312975.

Basili VR, Rombach HD. 1988. The TAME project: towards improvement-oriented
software environments. IEEE Transactions on Software Engineering 14(6):758–773
DOI 10.1109/32.6156.

Baxter P, Jack S. 2008. Qualitative case study methodology: study design and implemen-
tation for novice researchers. Qualitative Report 13(4):544–559.

Blank S. 2006. The four steps to the epiphany. Foster City: Cafepress.com.
Blumenfeld PC, Soloway E, Marx RW, Krajcik JS, Guzdial M, Palincsar A. 1991.

Motivating project-based learning: sustaining the doing, supporting the learning.
Educational Psychologist 26(3–4):369–398 DOI 10.1080/00461520.1991.9653139.

Brügge B, Krusche S, Alperowitz L. 2015. Software engineering project courses
with industrial clients. Transactions on Computing Education 15(4):17:1–17:31
DOI 10.1145/2732155.

Carver J, Jaccheri L, Morasca S, Shull F. 2003. Issues in using students in empirical
studies in software engineering education. In: Proceedings of the 9th international
software metrics symposium. 239–249.

Carver J, Jaccheri L, Morasca S, Shull F. 2010. A checklist for integrating student
empirical studies with research and teaching goals. Empirical Software Engineering
15(1):35–59 DOI 10.1007/s10664-009-9109-9.

Cochran-Smith M. 2003. Learning and unlearning: the education of teacher educators.
Teaching and Teacher Education 19(1):5–28 DOI 10.1016/S0742-051X(02)00091-4.

Creswell J. 2009. Research design: qualitative, quantitative, and mixed methods approaches.
3rd edition. Thousand Oaks: SAGE Publications Inc.

Deiters C, Herrmann C, Hildebrandt R, Knauss E, Kuhrmann M, Rausch A, Rumpe B,
Schneider K. 2011. GloSE-Lab: teaching global software engineering. In: Proceedings
of 6th IEEE international conference on global software engineering. Piscataway: IEEE.

Dewey J. 1935. How we think: a restatement of the relation of reflective thinking to the
educative process. Boston: DC Heath.

Dillon J. 2008. A review of the research on practical work in school science. Technical
report. King’s College Available at http://score-education.org/media/3671/review_of_
research.pdf .

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 32/36

https://peerj.com
http://serc.carleton.edu/sp/library/experiments/index.html
http://dx.doi.org/10.1109/TSE.1986.6312975
http://dx.doi.org/10.1109/32.6156
http://dx.doi.org/10.1080/00461520.1991.9653139
http://dx.doi.org/10.1145/2732155
http://dx.doi.org/10.1007/s10664-009-9109-9
http://dx.doi.org/10.1016/S0742-051X(02)00091-4
http://score-education.org/media/3671/review_of_research.pdf
http://score-education.org/media/3671/review_of_research.pdf
http://dx.doi.org/10.7717/peerj-cs.131


Easterbrook S, Singer J, Storey M-A, Damian D. 2008. Selecting empirical methods for
software engineering research. In: Shull F, Singer J, Sjøberg D, eds. Guide to advanced
empirical software engineering. London: Springer.

Eisenhardt KM. 1989. Building theories from case study research. The Academy of
Management Review 14(4):532–550.

Fagerholm F, Guinea AS, Mäenpää H, Münch J. 2017. The RIGHT model for
continuous experimentation. Journal of Systems and Software 123:292–305
DOI 10.1016/j.jss.2016.03.034.

Fagerholm F, Oza N, Münch J. 2013. A platform for teaching applied distributed
software development: the ongoing journey of the Helsinki software factory. In:
3rd international workshop on collaborative teaching of globally distributed software
development (CTGDSD).

Fagerholm F, Sanchez Guinea A, Mäenpää H, Münch J. 2014a. Building blocks for
continuous experimentation. In: Proceedings of the 1st international workshop on
rapid continuous software engineering. New York: ACM, 26–35.

Fagerholm F, Sanchez Guinea A, Münch J, Borenstein J. 2014b. The role of mentoring
and project characteristics for onboarding in open source software projects. In:
8th ACM-IEEE international symposium on software engineering and measurement
(ESEM).

Frank B. 1997. The impact of classroom experiments on the learning of economics: an
empirical investigation. Economic Inquiry 35(4):763–769
DOI 10.1111/j.1465-7295.1997.tb01962.x.

Fucci D, Turhan B, Oivo M. 2015. On the effects of programming and testing skills
on external quality and productivity in a test-driven development context. In:
Proceedings of the 19th international conference on evaluation and assessment in
software engineering. New York: ACM, 25:1–25:6.

Hatton N, Smith D. 1995. Reflection in teacher education: towards definition and
implementation. Teaching and Teacher Education 11(1):33–49
DOI 10.1016/0742-051X(94)00012-U.

Hayes J. 2002. Energizing software engineering education through real-world projects
as experimental studies. In: 15th conference on software engineering education and
training (CSEET).

Hevner AR, March ST, Park J, Ram S. 2004. Design science in information systems
research. MIS Quarterly 28(1):75–105.

Höst M. 2002. Introducing empirical software engineering methods in education. In:
Proceedings of the 15th conference on software engineering education and training
(CSEET). 170–179.

Jones JL, Jones KA. 2013. Teaching reflective practice: implementation in the teacher-
education setting. The Teacher Educator 48(1):73–85
DOI 10.1080/08878730.2012.740153.

Juristo N, Gómez OS. 2012. Replication of software engineering experiments. In: Meyer
B, Nordio M, eds. LASER summer school 2008–2010. Lecture Notes in Computer
Science. vol. 7007. Berlin, Heidelberg: Springer, 60–88.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 33/36

https://peerj.com
http://dx.doi.org/10.1016/j.jss.2016.03.034
http://dx.doi.org/10.1111/j.1465-7295.1997.tb01962.x
http://dx.doi.org/10.1016/0742-051X(94)00012-U
http://dx.doi.org/10.1080/08878730.2012.740153
http://dx.doi.org/10.7717/peerj-cs.131


Juristo N, Vegas S. 2011. The role of non-exact replications in software engineering
experiments. Empirical Software Engineering 16(3):295–324
DOI 10.1007/s10664-010-9141-9.

Keenan E, Steele A, Jia X. 2010. Simulating global software development in a course
environment. In: International conference on global software engineering (ICGSE).
Piscataway: IEEE.

Kitchenham BA, Budgen D, Brereton P. 2015. Evidence-based software engineering and
systematic reviews. Boca Raton: CRC Press.

Kohavi R, Deng A, Frasca B, Longbotham R, Walker T, Xu Y. 2012. Trustworthy online
controlled experiments: five puzzling outcomes explained. In: Proceedings of the 18th
ACM SIGKDD international conference on knowledge discovery and data mining, KDD
’12. New York: ACM, 786–794.

Koponen IT, Mäntylä T. 2006. Generative role of experiments in physics and in teaching
physics: a suggestion for epistemological reconstruction. Science & Education
15(1):31–54 DOI 10.1007/s11191-005-3199-6.

Kuhrmann M. 2012. A practical approach to align research with master’s level courses.
In: 15th international conference on computational science and engineering.

Kuhrmann M. 2015. Crafting a software process improvement approach—a retro-
spective systematization. Journal of Software: Evolution and Process 27(2):114–145
DOI 10.1002/smr.1703.

Kuhrmann M, Femmer H, Eckhardt J. 2014. Controlled experiments as means to teach
soft skills in software engineering. In: Yu L, ed. Overcoming challenges in software
engineering education: delivering non-technical knowledge and skills. Hershey: IGI
Global.

Kuhrmann M, Fernandez DM, Knapp A. 2013. Who cares about software process
modeling? A first investigation about the perceived value of process engineering and
process consumption. In: 14th international conference on product focused software
development and process improvement (PROFES).

Kuhrmann M, Fernández DM, Münch J. 2013. Teaching software process modeling. In:
35th international conference on software engineering (ICSE).

Kuhrmann M, Münch J. 2016a. Distributed software development with one hand tied
behind the back: a course unit to experience the role of communication in GSD. In:
11th International Conference on Global Software Engineering Workshops (ICGSEW).

Kuhrmann M, Münch J. 2016b. When teams go crazy: an environment to experience
group dynamics in software project management courses. New York: ACM, 412–421.

Longi K, Leinonen J, Nygren H, Salmi J, Klami A, Vihavainen A. 2015. Identification of
programmers from typing patterns. In: Proceedings of the 15th Koli calling conference
on computing education research, Koli Calling ’15. New York: ACM, 60–67.

Ludewig J, Bassler T, Deininger M, Schneider K, Schwille J. 1992. SESAM-simulating
software projects. In: Proceedings of the fourth international conference on Software
engineering and knowledge engineering, 1992. DOI 10.1109/SEKE.1992.227898.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 34/36

https://peerj.com
http://dx.doi.org/10.1007/s10664-010-9141-9
http://dx.doi.org/10.1007/s11191-005-3199-6
http://dx.doi.org/10.1002/smr.1703
http://dx.doi.org/10.1109/SEKE.1992.227898
http://dx.doi.org/10.7717/peerj-cs.131


Münch J, Rombach D, Rus I. 2003. Creating an advanced software engineering labora-
tory by combining empirical studies with process simulation. In: 4th international
workshop on software process simulation and modeling (ProSim).

Navarro EO, Van der Hoek A. 2007. Comprehensive evaluation of an educational
software engineering simulation environment. In: 20th conference on software
engineering education and training (CSEET).

Oza N, Münch J, Garbajosa J, Yague A, Ortega EG. 2013. Identifying potential risks and
benefits of using cloud in distributed software development. In: 14th international
conference on product-focused software development and process improvement (Profes).

Park CL. 2004. What is the value of replicating other studies? Research Evaluation
13(3):189–195 DOI 10.3152/147154404781776400.

Parker J. Using laboratory experiments to teach introductory economics. Working paper,
Reed College. Available at http://academic.reed.edu/economics/parker/ExpBook95.pdf
(accessed on 23 October 2014).

Rein A-D, Münch J. 2013. Feature prioritization based on mock purchase: a mobile case
study. In: Lean enterprise software and systems conference (LESS).

Richardson I, Milewski AE, Mullick N. 2006. Distributed development—an education
perspective on the global studio project. In: International conference on software
engineering (ICSE). New York: ACM.

Ries E. 2011. The lean startup: how today’s entrepreneurs use continuous innovation to
create radically successful businesses. New York: Crown Business.

Rombach D, Münch J, Ocampo A, Humphrey WS, Burton D. 2008. Teaching dis-
ciplined software development. International Journal of Systems and Software
81(5):747–763.

Runeson P. 2003. Using students as experiment subjects—an analysis on graduate and
freshmen student data. In: 7th international conference on empirical assessment in
software engineering (EASE).

Runeson P, Höst M. 2009. Guidelines for conducting and reporting case study re-
search in software engineering. Empirical Software Engineering 14(2):131–164
DOI 10.1007/s10664-008-9102-8.

Runeson P, Höst M, Rainer A, Regnell B. 2012. Case study research in software engineer-
ing: guidelines and examples. Hoboken: John Wiley & Sons.

Schön DA. 1983. The reflective practitioner: how professionals think in action. New York:
Basic Books.

Shull F, Singer J, Sjøberg DIK. 2008. Guide to advanced empirical software engineering.
London: Springer.

Stake RE. 1995. The art of case study research. Thousand Oaks: SAGE Publications, Inc.
Staron M. 2007. Using Students as subjects in experiments—a quantitative analysis of

the influence of experimentation on students’ learning proces. In: 20th conference on
software engineering education training. 221–228.

Tuckman BW. 1965. Developmental sequence in small groups. Psychological Bulletin
63(6):384–399 DOI 10.1037/h0022100.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 35/36

https://peerj.com
http://dx.doi.org/10.3152/147154404781776400
http://academic.reed.edu/economics/parker/ExpBook95.pdf
http://dx.doi.org/10.1007/s10664-008-9102-8
http://dx.doi.org/10.1037/h0022100
http://dx.doi.org/10.7717/peerj-cs.131


Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A. 2012. Experimenta-
tion in software engineering. Berlin, Heidelberg: Springer.

Wood DF. 2003. Problem based learning. BMJ 326(7384):328–330
DOI 10.1136/bmj.326.7384.328.

Yin R. 2009. Case study research: design and methods. 4th edition. Thousand Oaks: SAGE
Publications, Inc.

Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 36/36

https://peerj.com
http://dx.doi.org/10.1136/bmj.326.7384.328
http://dx.doi.org/10.7717/peerj-cs.131