key: cord-0288503-zxi024wb authors: Graziotin, Daniel; Lenberg, Per; Feldt, Robert; Wagner, Stefan title: Psychometrics in Behavioral Software Engineering: A Methodological Introduction with Guidelines date: 2020-05-20 journal: nan DOI: 10.1145/3469888 sha: b04dd98a500d1f3937e8b672ef00012e3b75fc12 doc_id: 288503 cord_uid: zxi024wb A meaningful and deep understanding of the human aspects of software engineering (SE) requires psychological constructs to be considered. Psychology theory can facilitate the systematic and sound development as well as the adoption of instruments (e.g., psychological tests, questionnaires) to assess these constructs. In particular, to ensure high quality, the psychometric properties of instruments need evaluation. In this paper, we provide an introduction to psychometric theory for the evaluation of measurement instruments for SE researchers. We present guidelines that enable using existing instruments and developing new ones adequately. We conducted a comprehensive review of the psychology literature framed by the Standards for Educational and Psychological Testing. We detail activities used when operationalizing new psychological constructs, such as item pooling, item review, pilot testing, item analysis, factor analysis, statistical property of items, reliability, validity, and fairness in testing and test bias. We provide an openly available example of a psychometric evaluation based on our guideline. We hope to encourage a culture change in SE research towards the adoption of established methods from psychology. To improve the quality of behavioral research in SE, studies focusing on introducing, validating, and then using psychometric instruments need to be more common. personnel selection, which has been employed across domains, including companies related to information technology [32, 150] . Another example is assessment of job satisfaction of workers, which is an evaluative judgment one makes about one's job or job situation [144] . A company would, arguably, attempt to foster job satisfaction of individuals, for example by employing agile methodologies instead of traditional processes [136] . Similarly, IT-related companies are interested in fostering motivation of software engineers, which is related to factors that energize, channel and sustain human behavior over time [46] . Stress of software developers can also be measured by psychological tests coupled with physiological measurements [101, 101] and biometrics [39] . Finally, to bring examples related to cognition and behavior, research in software engineering has recently turned attention to understanding and addressing cognitive biases while developing software [18, 151] and identifying and reducing gender bias in development teams [141] . Improper development, administration, and handling of psychological tests could harm the company by hiring a non-desirable person, and it could harm the interviewee because of missed opportunities. We believe that solid theoretical and methodological foundations should be the first step when designing a measurement instrument. The reality, however, is that not all tests are well developed in psychology [3] . Software engineering research, especially when studying psychological constructs quantitatively, is far from adopting rigorous and validated research artifacts. Already in 2007, McDonald and Edwards [90] subtitled their paper "Examining the use and abuse of personality tests in software engineering". The authors anticipated the issue that we attempt to address in the present manuscript, that is the "the lack of progress in this [personality research in software engineering] field is due in part to the inappropriate use of psychological tests, frequently coupled with basic misunderstandings of personality theory by those who use them" (p. 67). While their focus was on personality, their concerns are hold in a broader way to other psychometric instruments and constructs. Instances of mishandle 3 can, for example, be observed in the papers found by a systematic literature review of personality research in software engineering by Cruz et al. [30] . We noted in the results of Cruz et al. [30] that 48% of the personality studies in software engineering have employed the Myers-Brigg Type Indicator (MBTI) questionnaire, which has been shown to possess low to no reliability and validity properties [106] up to the point of being called a "little more than an elaborate Chinese fortune cookie" [64] . Feldt and Magazinius [41] similarly pointed out deficiencies of MBTI and proposed and used an alternative (IPIP) with more empirical support in the psychological literature. Feldt et al. [42] have argued in favor of systematic studies on human aspects of software engineering. More specifically, to adopt measurement instruments coming from psychology and related fields. Graziotin et al. [52] have echoed the call seven years after but found that research on the affect of software developers had been threatened by a misunderstanding of related constructs and how to assess them. In particular, the authors noted that peers in software engineering tend to confuse affect-related psychological constructs such as emotions and moods with related, yet different, constructs such as motivation, commitment, and well-being. Lenberg et al. [80] have conducted a systematic literature review of studies on human aspects in software development and engineering that made use of behavioral science, calling the field behavioral software engineering. Among their results, they found that software engineering research is threatened by several knowledge gaps when performing behavioral research, and that there have been very few collaborations between software engineering and behavioral science researchers. Graziotin et al. [54] , meanwhile, extended their prior observations on affect to a broader view of software engineering research with a psychological perspective. The work offered what we can consider the sentiment for the present article, that is brief guidelines to select a theoretical framework and validated measurement instruments from psychology. Graziotin et al. [54] called the field "psychoempirical software engineering" but later agreed with Lenberg et al. [80] to unify the vision under "behavioral software engineering". Hence, the present collaboration. Our previous studies have also reported that, when a validated test from psychology is adopted by software engineering researchers, its items are often modified, causing the destruction of its psychometric reliability and validity properties. This includes a thorough evaluation of the psychometric properties of candidate instruments. Gren and Goldman [57] have argued in favor of "useful statistical methods for human factors research in software engineering" (paper title), which include underused methods such as Test-Retest, Cronbach's , and exploratory factor analysisall of which are covered in this paper. Gren [56] has also offered a psychological test theory lens for characterizing validity and reliability in behavioral software engineering research, further enforcing our view that software engineering research that investigates any psychological construct quantitatively should maintain fair psychometric properties. We agree with Gren [56] that we should "change the culture in software engineering research from seeing tool-constructing as the holy grail of research and instead value [psychometric] validation studies higher. " (p. 3) . A mea culpa works better than a j'accuse in further building our case, so we bring a negative example from one of our previous studies. As reported in a very recent work by Ralph et al. [108] (which we appreciate in the next paragraph), "there is no widespread consensus about how to measure developers' productivity or the main antecedents thereof. Many researchers use simple, unvalidated productivity scales" (p. 6). In one of the earliest works by the first author of the present paper [53] , we compared the affect triggered by a software development task with the self-assessed productivity of individual programmers. While we were very careful to select a validated measurement instrument of emotions and to highlight how self-assessment of productivity converges to objective assessment of productivity, we used a single Likert item scale to represent productivity. This choice was to reduce as much as possible the items of the measurement instrument, which had to be used every ten minutes, for a total of nine times for each participant. While the results of the study are not invalidated by this choice, the productivity scale itself was not validated, making the results less valuable from a psychometric perspective and, thus, our interpretation of its results. The study was also (successfully) independently replicated twice, and both replications suffer from the same unfortunate choice. We wish to refrain from being overly negative. The field of software engineering does have positive cases-excluding those from the present authors-that we can showcase here. For example, Fagerholm and Pagels [40] developed a questionnaire on lean and agile values and applied psychometric approaches to inspect the structure of value dimensions. Fagerholm [38] has also embodied psychometric approaches in their PhD dissertation by analyzing the validity of the constructs they studied. A more recent example is by Ralph et al. [108] , who analyzed through a questionnaire the effects of the COVID-19 pandemic on developers' well-being and productivity. The authors constructed their measurement instrument by incorporating psychometrically validated scales on constructs such as perceived productivity, disaster preparedness, fear and resilience, ergonomics, and organizational support. Furthermore, they employed confirmatory factor analysis (which we touch upon in the present paper) to verify that the included items do indeed cluster and converge into the factors that are claimed to converge to. While positive cases do exist, we notice that they are fairly recent and that we can do better than that. We want to synthesize knowledge from psychology fields to software engineering research towards better quantitative studies of behavioral aspects. Filling the knowledge gap: introduction and guidelines to psychometric evaluation for behavioral software engineering research. Overall, we argue that one thing that is missing is an introduction to the field of psychometrics for behavioral software engineering researchers. Such an introduction can help improve the understanding of the available measurement instruments and, also, the development of new tests, allowing researchers as well as practitioners to explore the human component in the software construction process more accurately. Our overall objective is to address the lack of understanding and use of psychometrics in behavioral software engineering research including its limitations. We also hope to increase software engineering researchers' awareness and respect of theories and tools developed in established fields of the behavioral science, towards stronger methodological foundations of behavioral software engineering research. With this paper, we contribute to the behavioral software engineering body of knowledge with a set of guidelines which enable a better understanding of psychological constructs in research activities when we interpret them through measurement instruments. This improvement in research quality is achieved by either (1) reusing psychometrically validated measurement instruments, as well as understanding why and how they are validated, or, if no such questionnaires exist, (2) developing new psychometrically validated questionnaires that are better suited for the software engineering domain. Our contribution is enabled by offering one theoretical deliverable and one companion, practiceoriented deliverable. (1) We offer a review and synthesis of psychometric guidelines in form of several textbooks, review papers, as well as empirical studies, packaged in a style that is familiar to the software engineering researchers, including concrete examples and how to execute each activity. The guidelines enable evaluation of existing measurement instruments as well as developing new ones. (2) We offer a hands-on counterpart to our review by providing a fully reproducible implementation of our guidelines as R Markdown. The Standards for Educational and Psychological Testing (SEPT, American Educational Research Association et al. [3] ) is a set of gold standards in psychological testing jointly developed by the American Psychological Association (APA), National Council on Measurement in Education (NCME), and the American Educational Research Association (AERA). The book defines areas and standards that should be met when developing, validating, and administering psychological tests. We adopted SEPT as a framework to guide the paper construction, for ensuring that the standards are met and that the various other references are framed in the correct context. Additionally, we organized the scoping of the paper by comparing related work from the fields of psychology research. While the present paper is not a systematic literature review or a mapping study-the discipline is so broad that entire textbooks have been written on it-we systematically framed its construction to ensure that all important topics were covered. Several authors, e.g., Crocker [28] , Rust [121] , Singh et al. [129] , have proposed different phases for the psychometric development and evaluation of measurement instruments. Through our review, we identified 14 phases that we summarize visually in Figure 1 and outline as follows. (1) Identification of the primary purpose for which the test scores will be employed. (2) Identification of constructs, traits, and behaviors that are reflected by the purpose of the instrument. (3) Development of a test specification, delineation of the items proportion that should focus on each type of constructs, traits, and behaviors of the test. We focus mainly on those with a dark background in Figure 1 as they are the most challenging and usually not covered in software engineering research. As represented by the dashed lines, the process is linear only idealistically, when everything goes according to plans. Realistically, the construction of tests following a psychometric approach is failure-prone and iterative for correcting issues. Furthermore, not all steps have to necessarily be followed when developing a measurement instrument. We are illustrating a wide range of possibilities, some of which are often brought by in future validation studies. As a final note, the present paper, as well as any psychometric construction of measurement instruments, is not a checklist. A psychometric evaluation does not include all elements reported in this paper, as many facets of psychometrics are influenced by the research questions, study design, and data at hand. Yet, a proper psychometric evaluation requires a consideration of all elements reported in the present paper. After a brief introduction to the key concepts of psychometrics (section 2), that are required to understand the rest of the paper, we focus on test construction in psychometrics and the phases highlighted in Figure 1 , namely pool items (section 3), item review (section 4), pilot test (section 5), item analysis (section 6), factor analysis (section 7), statistical properties (section 8), reliability (section 9), validity (section 10), and fairness in testing and test bias (section 11). We close our guidelines with two opposing sources for inspiration: (1) A comprehensive list for further reading (section 12) to deepen what we are able to merely surface in this paper. (2) A review of limitations of psychometrics and their critique (section 13). Finally, we provide a hands-on running example (Section 14) of a psychometric evaluation. We provide R code and generated datasets openly [51] following open science principles in software engineering [43] . This section provides an overview of basic terms and concepts from psychometrics that will enable an understanding of all remaining sections. In particular, we clarify on psychometric models and test types (and types of testing) as these will sometimes have an influence on the statistical methods and lens to adopt when designing and evaluating a measurement instrument. The fundamental idea behind psychological testing is that what is being assessed is not a physical object, such as height and weight. Rather, we are attempting to assess a construct, that is a hypothetical entity [3] constructed by humans to represent concepts referring to various, concrete entities that are perceived in the moment, such as behaviors, experience, and attitudes [138] . If we assess the job satisfaction of a software developer, we are not directly measuring the satisfaction of the individual. Instead, we compare the developer's score with other developers' scores or a set of established norms for job satisfaction. When comparing the satisfaction scores between developers, we are limited to seeing how the scores differentiate between satisfied and unsatisfied developers according to the knowledge and ideas we have about satisfied an dissatisfied individuals. There are two common models of psychometrics, namely functionalist and trait [121, 148] . Functionalist psychometrics often occurs in educational and occupational tests; it deals with how the design of a test is determined by its application and not about the constructs being measured [55, 121] . For functionalist design, a good test is one that is able to distinguish between individuals who perform well and individuals who perform less well on a job or in school activities. This is also called local criterion-based validity (explained in section 10). The functionalist paradigm can be applied to most cases where a performance assessment or an evaluation is required. Trait psychometrics attempts to address notions such as human intelligence, personality, and affect scientifically [25, 94] . The classic trait approach was based on the notion that, for example, intelligence is related to biological individual differences, and trait psychometric tests aimed to measure traits that would represent biological differences among people [94] . Both schools of thought have several aspects in common, including test construction and validation methods, and they are linked by the theory of true scores [61] . The theory of true scores, or latent trait theory, is governed by formulas of the form: where is the observed score, is the true score, and is the error. There are three assumptions with the theory of true scores. (1) all errors are random and normally distributed, (2) true scores are uncorrelated with the errors, and (3) different measures of the observed score on the same participants are independent from each other. Besides all issues that come with the three assumptions, the theory has been criticized with the major point being that there is arguably no such thing as a true score, and that all that tests can measure are abstractions of psychological constructs [85] . Elaborations and re-interpretations of the theory of true scores have been proposed, among which is the statistical true score [15] . The statistical true score defines the true score as the score we would obtain by averaging an infinite number of measures from the same individual. With an infinite number of measures the random errors cancel each other, leaving with the true score . The statistical form of the theory of true scores should not be completely new to readers of software engineering, as most quantitative methods that are in use in our field nowadays are based on it. The statistical interpretation of the theory of true scores applies both to trait and functional psychometrics. A difference lies in generalization. Functional tests can only be specific to a certain context while trait tests attempt to generalize to an overall construct present in a group of individuals. Items on psychological tests can be knowledge-based or person-based [4, 121] . Knowledge-based tests assess whether an individual performs well regarding the knowledge of certain information, including possessing skills favoring performance or quality in knowledge-based tasks. For a software engineering example, debugging skills would be assessed by a knowledge-based test. Person-based tests, on the other hand, assess typical performance, or how a person is represented, with respect to a construct. Examples of constructs in person-based tests include personality, mood, and attitudes. The personalities of programmers in pair programming settings would be assessed by a person-based test. Knowledge-based tests are usually uni-dimensional as they gravitate towards the notion of possessing or not possessing certain knowledge. We can also easily rank individuals on their scores and state who ranks better. Person-based tests are usually multi-dimensional and do not allow direct ranking of individuals without some assumptions. For example, a developer could score high on extroversion. A high score on extroversion does not make a developer with a lower extroversion score a "worse" developer in any way, because of the lower score. A second distinction to be made is between criterion-referenced and norm-referenced testing [49] . Criterion-referenced tests are constructed with reference to performance on a-priori defined values for establishing excellence [6, 49] . Continuing with the example on debugging skills, a criterion-referenced test would assess, with a score from 0 to 10, whether a developer is able to open a debugging tool and use its ten basic functionalities. A score of 10 out of 10 would mean that the developer is able to debug software. Norm-referenced tests lack a-priori defined scores. What constitutes a high score is in relation to how everyone else scores. A test for assessing the happiness of software developers will return scores for each participant. The test itself will have a theoretical range, say −10 for strong unhappiness and +10 for strong happiness. When a developer scores +4 on our happiness scale, all we can say is that the developer is happy rather than unhappy. If we know that software developers score −3 on average, with a standard deviation of 1, then we know that the developer is quite a happy one. The development and evolution of norm-referenced testing attempts, in addition to developing valid and reliable instruments to establish norms, values for populations and sub-populations of individuals. That is, norm-referenced tests allow us to compare scores with respect to what is considered normal [7, 49] . The first steps in developing a measurement instrument is, of course, developing an initial set of items. Coaley [20] summarizes a planning and designing process in a way that reminds us of the Goal-Question-Metrics (GQM) model [11] : (1) Set clear aims-on defining the purpose of the measurement instrument, which constructs we are targeting, and what is the intended target population. (2) Define the attribute(s)-on moving in the empirical lenses from the construct, or the object, to its attributes being measured. In this step, it is advised to perform a comprehensive literature review of the theoretical concepts being assessed. (3) Write a plan-on drafting a specification of the measurement instrument as if it was completed already, including the test content, target group and population, kinds and number of needed participants, administration instructions, time constraints, and how scores should be interpreted. (4) Writing items-on designing and constructing a pool of items related to all previously defined steps. Sources for generating items could be experts in the domain or fields, interviewing potential respondents, and prior work [110] paired with constant comparison or update of research questions [100] . We do not want to spend too much space about writing items because of existing good literature: on the questionnaire construction process [9, 100, 110] , on question effects and question-wording effects [71, 125, 128] , and, for our field, on experiences and reviews on conducting surveys in software engineering research [19, 66, 74, 93, 140] . We will rather focus on selecting items from the produced pool. While some guidelines suggest to generate a double amount of items than those that are likely required [75] , we note in section 7 that factors require three to five associated answers to possess meaningful variance properties. Thus, a better strategy would be to develop six to ten items per envisioned factor, or sub-construct. When developing a new measurement instrument, we are likely (and encouraged by the previous section) to create more items than are really needed. Item review and item analysis are a series of methods to reduce the number of items of a measurement instrument and keep the best performing ones [59, 75] . This is a two-step process, as shown in Figure 2 . First, it requires a review by experts; then, a pilot study and statistical calculations. During the first step (item review), experts in the domain of knowledge evaluate items one by one and argue for their presence in the test [75, 124] . During the second step (item analysis), the developers of the measurement instrument calculate item facility and item discrimination based on a pilot study that uses the tentative set of items. During item review, experts in the domain of knowledge discuss candidate items and argue in favor or against them, as it happens when discussing inclusion and exclusion of publications in systematic literature reviews [73] . Given its usage in the medical domain [8] , we recommend using the Delphi technique [31] with domain experts to identify best candidates after initial qualitative probes to identify sub-constructs and items. To assess the degree of agreement among raters, we recommend using inter-rater reliability measures such as Cronbach's [29] and Krippendorff's [34] . After reaching an agreement on the items to be included, a pilot study is required for an analysis of the items. A pilot test is necessary to probe the effectiveness of the developed items. After item review we are left with a set of items that have been qualitatively evaluated by experts in the field. These items are too many, and we have no idea yet on how they would group together or contribute to the total variance related to the constructs we intend to represent. Hence, a pilot test. Pilot tests allow to discover early issues related to questionnaire items, for example on wording problems, lack of clarity, confusing steps, or even discrimination of and between respondents [110] . There are no clear guidelines on how big the sample size for pilot tests should be. First criterion is rather on representativeness of the pilot sample with respect to the target group or population [132] . Johanson and Brooks [67] have identified suggestions from the literature that sees = 30 participants as the lower limit to gather meaningful data and to reason for the test construction. Johanson and Brooks [67] have conducted a cost-benefit analysis to identify the point at which adding participants to a pilot sample size would incur in notably lesser effects in estimating population parameters. Analysis of sample size changes for a range of Pearson's correlations, range of proportions, and confidence intervals for reliability coefficients. All results confirm the indicated value of = 30 as reasonable minimum recommendation for a pilot study for a psychological test. A further possible novelty that we would like to report to the software engineering research audience is brought by Collins's [22] overview of cognitive methods to pretest survey instruments. The authors argue, in their review of the literature, that the tradition of survey pre-testing has mostly dealt with quantitative issues, e.g., standardizing data collection procedures such as question wording, worrying about clustering, and variance. All this, while valuable, should be coupled with techniques that go beyond the insofar assumption that respondents all understand the questions being asked in the same way, and that they are willing to respond to the same set of questions the same. Many of these techniques are about qualitative data rather than quantitative data. Cognitive methods enable exploring the process by which respondents think and act with when responding (or not) to questions, and which factors influence these answers. Cognitive methods include thinkaloud and probing in cognitive interviewing, paraphrasing, card sorts, vignettes, confidence ratings and response latency timing. In absence of consensus on the best process [63] , we will briefly summarize cognitive interviewing because it is an established method [35, 92, 146] and it brings qualitative methods to psychometrics. Cognitive interviewing is a qualitative method that studies the question-response process by participants when answering questions on psychological tests [92] . The aim is to use theories from the cognitive sciences to understand how participants perceive and interpret items, and to identify issues that may arise when distributing the instrument [35] . Cognitive interviewing requires a small sample of participants purposely selected for in-depth interviews on how and why they answered the question as they did [92] . The cognitive interview process is characterized by two verbal report methods: think-aloud and verbal probing [146] . In think-aloud, the participant is asked to (preferably) verbally explicating what they are thinking from the point of reading an item to the point of scoring the item. In verbal probing, the interviewer intervenes with specific questions and probes [22] . The former is respondent-driven, the latter is interviewer-driven. Examples of think-aloud questions are of open-ended nature and inviting for more thoughts, e.g.: "What are you thinking while reading this item?", "What makes you think that?", or "I noticed you hesitate for a while there, what made you hesitate?". Examples of verbal-probing (from Collins [22] , Table 2 ) include comprehension ("What does the term X mean to you?", retrieval ("How did you calculate your answer?"), confidence judgment ("How well do you remember this?"), and response ("How did you feel about answering this question?"). While qualitative methods are useful for identifying issues, we can apply quantitative techniques to reduce items. Item analysis refers to several statistical methods for the selection of items of a psychological test [59, 75] . Item analysis is often the only method for item reduction when a single construct should be evaluated, whereas multiple constructs and sub-constructs require further steps to be taken. Two of the most known techniques can be found in item facility and item discrimination. This section will explain both techniques according to the test type. 6.0.1 Item facility. Item facility for an item is a measure of tendency in answering one item with the same score. This has different meanings according to the type of test. Knowledge-based test. Item facility for an item, also known as item difficulty, is defined as the ratio of the number of participants who provided a right answer over the number of all participants to a test [75, 121] . The value of item facility ranges from 0 (all respondents are wrong) to 1 (all respondents are right). In other words, item facility is the probability of obtaining the right answer for the item [75] . Of interest for test construction is the variance of an item. An item variance is the calculated variance of a set of item scores, which is a set of zeros and ones. That is, the item variance for an item with facility 0 is 0 (all are wrong), and an item variance for an item with facility 1 is 1 (all are right). Both these extremes would render the item rather void, as all individuals scoring the same on an item would not tell us anything interesting on the individuals. What usually happens is that some individuals will get the answer right, some will get the answer wrong. For such cases, we can compute the variance for an item by using the formula in 2: where is the item and is the item facility for . The highest possible value for is 0.25, and this is the case when items are neither very easy or very difficult. When has small values, for example 0.047, it means that most respondents tend to reply the same way for that item, making it either extremely easy or extremely difficult. A value of near 0.00 does not warrant automatic exclusion of item , but the value should solicit a review. Person-based tests. Facility suggests that participants to all tests can be either right or wrong on an answer. What about trait measurement, where participants are not exactly right or wrong but are assessed in terms of a psychological construct? We can calculate item facility for these cases as well. The issue relies only in the naming, because item facility was developed for knowledge-based tests first. Some scholars prefer to use the term item endorsement or item location [112] to better reflect how calculations can be done on traits. For trait measurement, it is common to have Likert items [84] . Item facility for Likert items can be calculated with the mean value of the item. If a Likert item maps to the values 0 (strongly disagree) to 5 (strongly agree), the extreme values for the item will be 0 and 5 instead of 0 and 1. An item with average value of 4.8 with variance of 0.09 is a candidate for deletion, whereas an item with average value of 2.72 with variance of 3.02 is deemed interesting, because we prefer items that reflect on diversity of participants rather than having most participants score the same. Items such as those worded negatively should be reversed in their scoring prior to item analysis, so that all items have comparable values. 6.0.2 Item discrimination. Item discrimination is a technique to discover items that behave oddly with respect to what we expect participants to score the item. The meaning of item discrimination differs according to the type of test. Knowledge-based test. Item discrimination reflects items that behave oddly, in the sense that individuals that tend to score very high (or very low) on a test as a whole tend to be wrong (right) on the same item [89] . Such an item would possess a negative (positive) discrimination. Ideally, a test should have items with zero discrimination [75] . From a statistical point of view, if an item is uncorrelated with the overall test score, then it is almost certainly uncorrelated with the other items and making very little contribution to the overall variance of the test [121, 129] . Therefore, we calculate item discrimination by comparing the correlation coefficient of an item score and the overall test score. If the computed correlation coefficient is 0 or below, we should consider removing the item. Person-based tests. What holds for knowledge-based tests holds to a wide extent with person-based tests. Instead of assessing how well an item behaves with respect to the test score, we instead assess how an item is in fact measuring the overall trait in question. By calculating the correlation coefficient of an item and the overall test score for a specific trait, we will have an initial estimate of how well an item represents the trait in question. If the computed correlation coefficient is 0 or below, we should consider removing the item. The case of norm-referenced tests. The variance of an item, here calculated using the classic definition from statistics, is interesting also in the context of norm-referenced testing. Item facility also applies with norm-referenced testing, as the purpose of the test is to spread out individuals' scores as much as possible on a continuum. A larger spread is due to a larger variance, and we are interested in including items that make a contribution to the variance [75, 121] . Furthermore, if an item has a high correlation to other items and has a large variance, it derives that the item makes high contribution to the total variance of a test and it will be kept in the pool of items. The case of criterion-referenced tests. Item analysis is often seen as applicable to norm-referenced testing exclusively [121] . With criterion-referenced testing, it is still possible to calculate item facility and item discrimination, and these can be conducted, for example, before and after teaching and formative activities (and this might include workshops of, say, Scrum methods, at IT companies). A difference in item facility before and after the teaching activities would indicate that the item is a valid measure of the skill taught. This would turn the measure of item facility into a measure of item discrimination as well. 6.0.3 Limitations of item analysis. Item analysis, while valuable and still in use today, is part of the classical test theory (CTT), which assumes that an individual's observed score is the same as a true score plus an error score [135] (see formula 1). Modern replacements for CTT have been proposed, and the most prominent one is item response theory (IRT) [36] . IRT models build upon a function (called item response function, IRF, or item characteristics curve, ICC) that defines the probability of being right or wrong on an item [2] . IRT is outside the scope of the present paper as CTT is still in place to this day [105] and explaining IRT requires a publication on its own. Item analysis, as presented in this section, assumes that there is a single test score, meaning that a single construct is being measured. Whenever multiple constructs or a construct of multiple factors are being measured, item analysis requires to be accompanied by factor analysis [129] . Factor analysis is one of the most widely employed psychometric tools [75, 77, 129] and it can be applied to any dataset where the number of participants is higher than the number of item scores under observation. Factor analysis is for understanding which test items "go together" to form factors in the data that ideally should correspond to the constructs that we are aiming to assess [121] . At the same time, factor analysis allows to reduce the dimensionality of the problem space (i.e., reducing factors and/or associated items) and explaining the variance in the observed variables compared to underlying latent factors [77] . In case we intend to assess a single construct, factor analysis helps in identifying those items that (best) represent the construct we are interested in, so that we can exclude the other items. Factor analysis techniques are based on the notion that those constructs that we observe through our measurement instruments can be reduced to fewer latent variables which are unobservable but share a common variance [152] (see Section 2) . Factor analysis starts with computed correlation coefficients as its first building block. A way to summarize correlation coefficients is through a correlation matrix, which is a matrix of items that displays the correlation coefficient between all items of the tentative test. Table 1 provides an example correlation matrix for five items , , , , and . Given that two items ( and ) correlate with each other the same, no matter their order ( ( , ) = ( , )), and that a single item correlates with itself with a perfect correlation coefficient ( ( , ) = 1.00), the matrix displays only its lower triangle, omitting the repeated upper part, with the value of 1.00 as its diagonal [120, 152] . A high correlation among certain items, in our case with and , and with , indicate these items might belong to the same factor. This approach, however, lacks part of the story. Questions such as "how do our candidate factor explain the total variance of the measurement instrument", "to which candidate factor does an item belong more?" and "how are factors related to each other" are better answered by further analysis. There are two main factor analysis techniques, namely Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) [77, 112, 129] . EFA attempts to uncover patterns and clusters of items by exploring predictions, while CFA attempts to confirm hypotheses on existing factors [152] . Exploratory factor analysis (EFA) is a family of analysis techniques aimed to reduce the number of items by retaining the items that are most relevant to certain factors [77] . Strictly speaking, when developing a measurement instrument, after item analysis, it is desirable to observe whether the measurements for the items tend to cluster. These clusters are likely to represent different factors that might or might not pertain to the construct being measured [75, 121] . EFA provides tools to group and select items from a correlation matrix. EFA operates on the equation in 3 for a measure 1 [129, 152] : where are those factors grouping the items being analyzed, are factors that are unique to each measure, are loading of each item on respective factors, and are random measurement errors. Factor loadings are, in practice, weights that provide us with an idea of how much an item contributes to a factor [152] . From the equation we derive that the variance of the constructs being measured is explained by three parts: (1) the common factors, also known as communality of a variable [37, 129, 152] (2), the influence of factors that are unique to that measure, and (3) random error, or = ℎ + + + . Estimates for communalities of an item are often referred to as ℎ 2 . ℎ 2 is the calculated proportion of variance that is free of error and is thus shared with other variables in a correlation matrix [37, 129, 152] . Several techniques calculate the communality of a variable by summing the squared loadings of each variable associated with a variable. Estimates for the unique variance, denoted as 2 , is the proportion of variance that is associated with communalities, that is 2 = 1 − ℎ 2 [37, 129, 152] . Determining a value for 2 for an item allows us to find how much specific variance can be attributed to that variable. Lastly, the random error that is associated with an item is the last component of the total variance. Random error is also often called the unreliability of variance [37, 152] . Unique factors are never correlated with common factors, but common factors may or may not be correlated with each other [152] . EFA encompasses three phases [77, 112, 121, 129] , described in Figure 3 . First, we have to select a fitting procedure to estimate the factor loadings and unique variances of the model. Then, we need to define and extract a number of factors. Finally, we need to rotate the factors to be able to properly interpret the produced factor matrices. Many statistical programs allow to either perform all these phases separately or to perform more than one at the same time. It is not an easy task to assign a methodology to one of the three categories below. The reader is advised that some textbooks avoid our classification of phases and simply revert to a more practical set of questions, e.g., "how to calculate factors" and "how many factors should we retain". We also note that recent studies have formulated Bayesian versions of these classical exploratory factor analysis techniques and claimed several benefits [23] . 7.1.1 Factor loading. The most common technique for estimating the loadings and variance is to use the standard statistical technique of principal component analysis (PCA) [102] . PCA assumes that the communalities for the measures are equal to 1.0. That is, all the variance for a measure is explained only by the factors that are derived by the methodology, and hence there is no error variance. PCA operates on the correlation matrix, mostly on its eigenvalues, to extract factors that correlate with each other. The eigenvalue of a factor represents the variance of the variables accounted for by that factor. The lower the eigenvalue, the least the factor contributes to the variance explanation in the variables [97] . Factor weights are computed to extract the maximum possible variance, with successive factoring continuing until there is no further meaningful variance left. PCA is not a factor analysis method strictu sensu, as factor analysis assumes a presence of error variance rather than being able to explain all variance. Some advocates prefer to state that its output should be referred to as a series of components rather than factors. While more simplistic than other techniques to estimate factor loading, performing a PCA is still encouraged as a first step in EFA before performing the actual factor analysis [120, 121, 152] . Among the proper factor analysis techniques that exist, we are interested in a widely recommended technique for estimating loadings and variance named principal axis factoring (PAF) [75, 120, 145] . PAF does not operate under the assumption that the communalities are equal to 1.0, so the diagonal of the correlation matrix (e.g., the one in Table 1 ) is substituted with estimations of communalities, ℎ 2 . PAF estimates the communalities using different techniques (e.g., the squared multiple correlation between a measure and the rest of measures) and a covariance matrix of the items. Factors are estimated one at a time up until there is a large enough variance accounted for in the correlation matrix. Under PAF, the ordering of the factors determines their importance in terms of fitting, e.g., the first factor accounts for as much variance as possible, followed by the second factor, and so on. Russell [120] provides a detailed description of the underlying statistical operations of PAF, that we will omit for the sake of brevity. Most statistical software provides functions to implement PAF. Both PCA and PAF result in values assigned to candidate factors. Therefore, there has to be a strategy to extract meaningful factors. The bad news here is that there is no unique way, let alone a single proper way, to extract factors [114, 120, 129] . More than one strategy should be adopted in parallel to allow a comparison of results, ending with a sense-making analysis to take an ultimate decision [37] . Several factor extraction techniques exist, which are mentioned in the cited references in the present section. We provide here those that are used most widely as well as those that are easier to apply and understand. Our running example (Section 14) enables an easier visualization of the concepts below. Perhaps the simplest strategy to extract factors is Kaiser's eigenvalue-greater-than-one (K1) rule [70] . The rule simply states that factors with eigenvalue higher than 1.0 should be retained. Kaiser's rule is quite easy to apply but it is highly controversial [26, 37, 114, 120] . First, the rule was originally designed for PCA and not for PAF or other factor analysis methods, which might make it unsuitable for methodologies that provide estimation for commonalities as diagonals for correlation matrices [26] . Second, the cut-off value for 1.0 might discriminate for factors that are just above or just below 1.0 [26] . Third, computer simulations found that K1 tends to overestimate the number of factors [37] . Yet, K1 is still the default option for some statistical software suites, making it an unfortunate de-facto main method for factor extraction [26] . Cattell [17] scree test, also based on eigenvalues, foresees the plot of the eigenvalues extracted from either the correlation matrix or the reduced correlation matrix (thus making it suitable for both PCA and PAF) against the factor they are associated with, in descending order. One then inspects the curved line for a break in the values (or an elbow) up to when a substantial drop in the eigenvalues cannot be observed anymore. The break is a point at which the shape of the curve becomes horizontal. The strategy is then to keep all factors before the breaking point. The three major criticisms of this approach is that it is subjective [26] , that more than one scree might exist [134] , and that data often does not offer a discernible scree and a conceptual analysis of the candidate factor is thus always required [121] . Revelle and Rocklin [117] proposed the Very Simple Structure (VSS) method for factor extraction that is based on assessing how the original correlation matrix can be reproduced by a simplified pattern matrix, for which only the highest loading of each item is retained (everything else set to zero) [26] . The VSS criterion to assess how well the pattern matrix performs is a number from 0 to 1, making it a goodness-of-fit measure almost of a confirmatory nature rather than an exploratory one [116] . The VSS criterion is gathered for solutions involving a number of factors that goes from 1 to a user-specified value. The strategy ends with selecting the number of factors that provide the highest VSS criterion. Finally, the method of parallel analysis (PA), introduced by Horn [65] , was found to be very robust for factor extraction [26, 37] . PA starts with the K1 concept that only factors with eigenvalue larger than 1.0 should be kept. Horn [65] has argued that the K1 rule was developed with population statistics and was thus unsuitable when sampling data. Sampling errors would then cause some components from uncorrelated variables to have eigenvalues higher than one in the population [26] . PA takes into account the proportion of variance that results from sampling rather than being able to access to the population. The way it achieves this is a constant comparison of the solution with randomly generated data [116] . PA generates a large number of matrices from random data in parallel with the real data. Factors are retained as long as they are greater than the mean eigenvalue generated from the random matrices. The last step is to rotate the factors in the dimensional space for improving our interpretation of the results [33, 75, 77 ]. An unrotated output, that is the one that often results after factor extraction, maximizes the variance accounted for by the first factor, followed by the second factor, the third factor, and so on. That is, most items would load on the first factors and many of them would load on more than one factor in a substantial way. Rotating factors build on the concept that there are a number of "factor solutions" that are mathematically equivalent to the solution found after factor extraction. By performing a rotation of the factors, we retain our solution but allow an easier interpretation. We rotate the factors to seek a simple structure. A simple structure is a loadings pattern such that each item loads strongly on one factor only and weakly on other factors. If the reader is interested in mathematical foundations of factor rotation, two deep overviews of factor rotation are offered by Browne [10] , Darton [33] . There are two families of rotations, namely orthogonal and oblique [120] . Orthogonal rotations force the assumption of independency between the factors, whereas oblique rotations allows the factors to correlate with each others. Which methodology to use is influenced by the statistics software; for example, R psych package [115] provides "varimax", "quartimax", "bentlerT", "equamax", "varimin", "geominT", and"bifactor" for orthogonal rotations and "promax", "oblimin", "simplimax", "bentlerQ, "geominQ", "biquartimin", and"cluster" for oblique rotations. Several rotation methodologies are summarized by Abdi [1] , Browne [10] , Russell [120] . Perhaps the most known and employed [33] orthogonal rotation method is the Varimax rotation [69] . Varimax maximizes the variance (hence the name) of the squared loadings of a factor on all variables. Each factor will tend to have either large or small loadings of any particular variable. While this solution makes it rather easy to identify factors and their loading on items, the independency condition of orthogonal rotation techniques is hard to achieve. The assumption of independency of factors, especially in the context of behavioral research, belittles the value of orthogonal rotation techniques, to the point that "we see little justification for using orthogonal rotation as a general approach to achieving solutions with simple structure" [37, p. 283 ]. Oblique rotation is preferred for behavioral software engineering studies, as it is sensible to assume that behavioral, cognitive, and affective factors are separated by soft walls of independence (e.g., motivation and job satisfaction) [37, 120, 121] . If any, one would have to first conduct an investigation using oblique rotation and observe if the solution shows little to no correlation between factors and, in that case, switch to orthogonal rotation [37] . The two most employed and recommended oblique rotation techniques are Direct Oblimin (and its slight variation Direct Quartimin) and Promax, both of which perform well [37] . Fabrigar et al. [37] , Russell [120] recommended to use a Promax rotation because it provides the best of both approaches. A Promax rotation first performs an orthogonal rotation (a Varimax rotation) to maximize the variance of the loadings of a factor on all the variables [120] . Then, Promax relaxes the constraint that the factors are independent from each other, turning the rotation to oblique. The advantage of this technique is that it will reveal whether factors really are uncorrelated with each other [120] . There have been several recommendations regarding the required sample size, number of measures per factor, number of factors to retain, and interpretation of loadings [120, 129, 152] . The recommended overall sample size as reported by Yong and Pearce [152] is at least 300 participants, with each variable that is subject to factor analysis with at least 5 to 10 observations. This recommendation has, however, low empirical validation. As reported by Russell [120] , a Monte Carlo study by MacCallum et al. [88] analyzed how different sample sizes and communalities of the variables were able to reproduce the population factor loadings. They found that with item communalities higher or equal to 0.60, results were very consistent with sample sizes as low as 60 cases. Communality levels around 0.50 required samples of 100 to 200 cases. In their review, Russell [120] also found that 39% of EFA studies involved samples of 100 or fewer cases. On the number of measures (items) per factor, Yong and Pearce [152] report that for something to be labeled as a factor it should have at least 3 variables, especially in cases when factors receive a rotation treatment, where only a high correlation (coefficient higher than 0.70) with each other and mostly uncorrelation with other items would make them worthy of consideration. Generally speaking, the correlation coefficient for an item to belong to a factor should be 0.30 or higher [131] . Russell [120] identifies that prior work has requested at least three items per factor; however, four or more items per factor was found to be a better holistic way to ensure an adequate identification of the factors. In his review he identified 25% studies with three or fewer measures per factor. We reported on the number of factors to retain in section 7.1.2, so we will not repeat ourselves here. There is not a recommended number and one would follow (possibly more than) one extraction method to identify the best number of factors according to the case. Tabachnick et al. [131] add that cases with missing values should be deleted to prevent overestimation of factors. Russell [120] wrote something that is worthy of mentioning for the uninitiated behavioral software engineering researchers, that is, even when constructing a new measurement instrument there is already an expectation of possible factors in the mind of the researcher. The reason is that items are developed following an investigation of prior work and/or empirical data (see section 6). That number is a good starting point to base ourselves when conducting EFA. Yong and Pearce [152] spends some further explanations on interpretation of loadings when they are produced by a statistical software. There should be few item crossloadings (i.e., split loadings, when an item loads at 0.32 or higher with two or more factors) so that each factor defines a distinct cluster of interrelated variables. There are exceptions to this that require an analysis of the items. Sometimes it is useful to retain an item that crossloads, with the assumption that it is the latent nature of the variable. Furthermore, Tabachnick et al. [131] report that, with an alpha level of 0.01, a rotated factor loading with a meaningful sample size would need to have a value of at least 0.32 for loadings as this would correspond to approximately 10% of the overlapping variance. As a final note, the reader might now be left wondering, why conducting item analysis to reduce items if factor analysis is available? Kline [75] argues that the so many phases of factor analysis foresee assumptions of data distribution and decisions on techniques, that much could go wrong with it. Furthermore, items selected by item analysis are highly correlated to items selected by factor analysis [99] , making item analysis a cheaper, effective initial method for item reduction. Contrary to exploratory factor analysis, confirmatory factor analysis (CFA) is for confirming a-priori hypotheses and assessing how well an hypothesized factor structure fits the obtained data [120] . A hypothesized factor could be derived from existing literature as well as data from a previous study to explore the factor structure. Once the data is obtained to compare to the hypothesized factor structure, a goodness-offit test should be conducted. CFA requires statistical modeling that is outside the scope of this paper and the estimation of the goodness-of-fit in CFA is a long lasting debate as "there are perhaps as many fit statistics as there are psychometricians" ( [113, p. 31] ). Russell [120] , Rust [121] , Singh et al. [129] provide several techniques for estimating the goodness-of-fit in CFA, e.g., Chi-squared test, root mean square residual (RMSR), root mean square error of approximation (RMSEA), and the standardized RMSR. Statistical software implement these techniques, including R psych package [116] . A widely employed technique for CFA is to be found in structural equation modeling (SEM), which is a family of models to fit networks of constructs [72] . MacCallum and Austin [87] provided a comprehensive review of SEM techniques in the psychological science including their applications and pitfalls. Conducting both EFA and CFA is very expensive. When designing and validating a measurement instrument, and when obtaining a large enough sample of participants, it is common to split the sample for conducting EFA on a part of it and CFA on the remaining part [129] . Most authors, however, prefer conducting EFA only [120] and rely on future independent studies towards a better psychometrics evaluation of a tool. This is also why statistical tools, e.g., R psych package [115] provide estimates of fits for EFA as well as convenience tools to adapt the data to CFA packages, e.g., R sem [45] . We refer the reader to a prior work of ours in the behavioral software engineering domain [81] where we conducted a CFA and describe its application. We also note that, like for EFA, there have also been Bayesian methods for CFA proposed [5, 78] . Assessing characteristics and performance of individuals poses several challenges when interpreting the resulting scores. One of them is that a raw score is not meaningful without understanding the test standardization characteristics [75] . For example, a score of 38 on a debugging performance test is meaningless without knowing that 38 means to be able to open a debugger only. Furthermore, the interpretation of the results vary wildly when knowing that on average, developers score 400 on the test itself compared to if they score 42. The former issue is related to criterion-referenced standardization, the latter to norm-referenced standardization [75, 121] . Criterion-referenced tests assess what an individual with any score is expected to be able to do or know. Norm-referenced standardization enables to compare an individual's score to the ordered ranking of a population (also see section 2). We concentrate on norm-referenced standardization as criterion-referenced standardization is unique to a test criteria. A first step to norm-reference a test is to order the results of all participants and rank an individual's score. Measures such as median and percentiles are useful for achieving the ranking and compare. When we can treat our data as interval scales and have it approximately following a normal distribution, we can also use the mean and the standard deviation. The standard deviation is useful for telling us how much an individual's score is above or below the mean score. Instead of reporting that an individual's score is, e.g., 13 above the mean score, it is more interesting to know that the score is 1.7 standard deviations above the mean score. Hence, we norm-standardize scores using different approaches. The remaining of this section is modeled after Rust [121] text and augmented by further explanations from other sources. Whenever a sample approximates a normal distribution, we know that a score above average is in the upper 50% and by following the three sigma empirical rule [107] , we know that a score greater than one or two standard deviations from the mean is in the top 68% and 95% respectively. For expressing an individual's score in terms of how distant it is from the mean score, we transform the value to its Z score (also called standard score) using the formula in 4: where is a participant's score,¯is the mean of all participants' scores, and is the standard deviation for all participants' scores. The ideal case would be to use the population mean and standard deviation. In software engineering research we lack studies estimating population characteristics (an example of norm studies was provided by [50] ), so we should either aggregate the results of some studies or gather more samples. An important note is that transforming scores into a Z scores does not make the scores normally distributed. This would require a normalization procedure, explained below. Z scores typically range between -3.00 and +3.00. The range is not always suitable for its application. A software developer could, for example, object to a Z score of -0.89 which, at first glance, might suggest to be low (value) or negative (sign). A T score, not to be confused with t-statistics of the Student's t-test, is a standard Z score that is scaled and shifted so that it has a mean of 50 and a standard deviation of 10. T scores thus typically range between 20 and 80. For transforming a Z score into a T score, we use the formula in 5. The software developer in the previous example would have a T score of 41.1 from a Z score of −0.89. Stanine and sten scores respond to the need of transforming a score to a scale from 1 to 9 (stanine) or 10 (sten) with a mean of 5 (stanine) or 5.5 (sten) and a standard deviation of 2. These scores purposely lose precision by keeping only decimal values. The conversion to stanine and sten scores follows the rules in Table 2 . The advantage of stanine and sten scores lies in their imprecision. If our non performing developer with a Z score of -0.89 was compared with other two developers having scores of -0.72 and -0.94, how meaningful would be such a tiny difference in scores? Their stanine scores are 3, 4, and 4, respectively. Their sten score would be 4. Stanine and sten scores provide clear cut-off points for easier comparisons. There is an important difference between stanine and sten scores, besides their range. A stanine score of 5 represents an average score in a sample. An average sten sore can not be obtained, because the value of 5.5 does not belong to its possible values. A score of 4 represents the low average band (which ranges from 4.5 to 5.5, that is one standard deviation below the man), and Test-retest Parallel forms Split-half Inter-rater 11. Reliability a score of 6 represents the high average band (which ranges from 5.5 to 6.5, that is one standard deviation above the mean). The standardization techniques that we presented in the previous section carry the assumption that the sample and population approximate the normal distribution. For all other cases, it is possible to normalize the data. Examples include algebraic transformation, e.g., square-rooting or log transformation, as well as graphical transformation. See introductory statistical texts for more detailed explanations of and a broad set of such transformations. Reliability can be seen either in terms of precision, that is the consistency of test scores across replications of the testing procedure (reliability/precision), or as a coefficient, that is the correlation between scores on two equivalent forms of the test (reliability/coefficients) [3] ). For evaluating the precision of a measurement instrument, it would be ideal to have as many independent replications of the testing procedure as possible on the same very large sample. Scores are expected to generalize over alternative tests forms, contexts, raters, and over time. The reliability/precision of a measurement instrument is then assessed through the range of differences of the obtained scores. The reliability/precision of an instrument should be assessed with as many sub-groups of a population as possible. The reliability/coefficients of a measurement instrument, which we will simply call reliability from this point on, is the most common way to refer to the reliability of a test [3] . There are three categories of reliability coefficients, namely alternate-form (derived by administering alternative forms of test), test-retest (same test on different times), and internal-consistency (relationship among scores derived from individual test items during a single session). We adhere to Nunnally [98] , Rust [121] classification of reliability and provide a brief overview of reliability facets in psychometric theory in Figure 4 . Several factors, as defined in the SEPT [3] , affect the reliability of a measurement instrument, especially adding or removing items, changing wording or intervals of items, causing variations in the constructs to be measured (e.g., using a measurement instrument for happiness to assess job satisfaction of developers), and ignoring variations in the intended populations for which the test was originally developed. We now introduce the most widely employed techniques for establishing the reliability of a test. Test-retest reliability, also known as test stability, is assessed when administering the measurement instrument twice to the same sample within a short interval of time. The paired set of scores for each participant is then compared with a correlation coefficient such as Pearson product-moment correlation coefficient or Spearman's rank-order correlation. A correlation coefficient of 1.00, while rare, would indicate a perfect test-retest reliability, whereas a correlation coefficient of 0.00 would indicate no test-retest reliability at all. A negative score is no good news either, and it is automatically considered as a value of 0.00. Test-retest reliability is not suitable for certain tests, such as those assessing knowledge or performance in general. Participants either face a learning or motivation effect from the first test session or simply improve (or worsen) their skills between sessions. Fur such cases, the parallel forms method is more suitable. The technique requires a systematic development of two versions of the same measurement instrument, namely two parallel tests, that are assessing the same construct but using different wording or items. Parallel tests for assessing debugging skills would feature the same sections and amount of items, e.g., arithmetic, logic, and syntax errors. The two tests would need different source code snippets that are, however, very similar. A trivial example would be to test for unwanted assignments inside conditions in different places and with different syntax (e.g., using if (n = foo()) in version one and if (x = y + 2) in version two). As with test-retest reliability, each participant faces both tests and a correlation coefficient can be computed. Split-half reliability is a widely adopted and more convenient alternative to parallel forms reliability. Under this technique, a measurement instrument is split into two half-size versions. The split should be as random as possible, e.g., splitting by taking odd and even numbered items. Participants face both halves of the test and, again, a correlation coefficient can be computed. The obtained coefficient, however, is not a measure of reliability yet. The reliability of the whole measurement instrument is computed with the Spearman-Brown formula in 6. where ℎ is the correlation of the split tests. This formula shows that the more discriminating items a test has, the higher will be its reliability. Inter-rater reliability is perhaps the most common reliability that is found in software engineering studies. Qualitative studies or systematic literature reviews and mapping studies often have different raters for evaluating the same items. The sets of rates can be assessed using a correlation coefficient. Cohen's is widely used in the literature for inter-rater coefficient of two entities together with Fleiss' for the inter-rater coefficient of more than two entities. Cases have also been made for using Krippendorff's [34] . The standard error of measurement is used for generating confidence intervals around reported scores. The score is strictly related to the reliability coefficient [121] as shown in formula 7 where is the variance of the test scores and is the calculated reliability coefficient of the test. The standard error of measurement also provides an idea of how errors are distributed around observed scores. The standard error of measurement is maximized-and becomes equal to the standard deviation of the observed scores-when a test is completely unreliable. The standard error of measurement is minimized to zero when a test is perfectly reliable. If the assumption that errors are distributed normally is met, one can calculate the 95% confidence interval by using the z curve value of 1.96 to construct the interval ± * 1.96. Confidence intervals could also be used to compare participants' scores. Should one participant score fall below or above the interval, their results would differ significantly from the normality of scores. Validity in psychometrics is defined as "The degree to which evidence and theory support the interpretation of test scores for proposed uses of tests." [3] . Psychometric validity is therefore a different (but related) concept than the one of study validity that software engineers are used to deal with [41, 104, 127, 147] . Validation in psychometric research is related to the interpretation of the test scores. For validating a test, we should gather relevant evidence for providing a sound scientific basis for the interpretation of the proposed scores. Kline [75] , Rust [121] have summarized six major facets of validity in the context of psychometric tests, which we summarize and represent in Figure 5 and describe below, augmented with references to material that offers additional explanations. Face validity concerns how the items of a measurement instrument are accepted by respondents. For example, software developers expect the wording of certain items to be targeted to them instead of say, children. Similarly, if a test presents itself to be about a certain construct, such as debugging expertise, it could cause face validity issues if it contained a personality assessment. Content validity (sometimes called criterion validity or domain-referenced validity) concerns the extent to which a measurement instrument reflects the purpose for which the instrument is being developed. If a test was developed under the specifications of job satisfaction, but measured developers' motivation instead, it would present issues of content validity. Content validity is evaluated qualitatively most of the times because the form of deviation matters more than the degree of deviation, but there are proposals for its quantitative estimation [149] . Predictive validity is a statistical validity defined as the correlation between the score of a measurement instrument and a score of the degree of success of the selected field. For example, the degree of success of debugging performance capability is expected to be higher with a higher programming experience. Computing a score for predictive validity is as simple as calculating a correlation value (such as Pearson or Spearman). According to the acceptance criterion for predictive validity, a score higher than 0.5 could be considered an acceptable predictive validity for the items. We would then feel justified in including programming experience as an item to represent the construct of debugging performance capability. Concurrent validity is a statistical validity that is defined as the correlation of a new measurement instrument and existing measurement instruments for the same construct. A measurement instrument tailored to the personality of software developers ought to correlate with existing personality measurement instruments. While concurrent validity is a common measure for test validity in psychology, it is a weak criterion as the old measurement instrument itself could have a low validity. Nevertheless, concurrent validity is important for detecting low validity issues in measurement instruments. Construct validity is a major validity criterion in psychometric tests. As constructs are not directly measurable, we observe the relationship between the test and the phenomena that the test attempts to represent. For example, a test that identifies highly communicative team members should have a high correlation with. . . observations of highly communicative people who have been labelled as such. The nature of construct validity is that it is cumulative over the number of available studies [123] . Differential validity assesses how a measurement instrument correlates with measures from which it should differ, and how a measurement instrument correlates with measures from which it should not differ. In particular, Campbell and Fiske [13] have differentiated between two aspects of differential validity, namely convergent and discriminant validity. Rust [121] mentions a straightforward example of both. A test of mathematics reasoning should correlate positively with a test of numerical reasoning (convergent validity). However, the mathematics test should not strongly correlate positively with a test of reading comprehension, because the two constructs are supposed to be different (discriminant validity). In case of a low discriminant validity, there should be an investigation of whether the correlation is a result of a wider underlying trait, say, general intelligence. Differential validity is overall empirically demonstrated by a discrepancy between convergent validity and discriminant validity. Fairness is "the quality of treating people equally or in a way that is right or reasonable" (Cambridge [12] , online.). A test is fair when it reflects the same constructs for all participants, and its scores have the same meaning for all individuals of the population [3] . A fair test does neither advantage or disadvantage any participant through characteristics that are not related to the constructs under observation. From a participant point of view, an unfair test brings a wrong decision based on the test results. An example of a test that requires fairness is an attitude or skills assessment when interviewing candidates for hire in an information technology company. The SEPT [3] reports on several facets of fairness. Individuals should have the opportunity to maximize how they perform with respect to the constructs being assessed. Similarly, for a measurement instrument that assesses traits of participants, the test should maximize how it assesses that the constructs being measured are present among individuals. This fairness comes from how the test is administered, which should be as standardized as possible. Research articles should describe the environment for the experimental settings, how the participants were instructed, which time limits were given, and so on. Fairness also comes, on the other hand, from participants themselves. Participants should be able to access the constructs as being measured without being advantaged or disadvantaged by individual characteristics. This is an issue of accessibility to a test and is also part of limiting item, test, and measurement bias. We provide an overview or bias in psychometric theory in Figure 6 . Rust [121] provides an overview of item, test, and measurement bias, which we supplement with related work. It almost feels unnecessary to state that a measurement instrument should be free from bias regarding age, sex, gender, and race. These cases are indeed covered by legislation to ensure fairness. In general, there are three forms of bias in tests, namely item bias, intrinsic test bias, and extrinsic test bias [121] . Item bias, also known as differential item functioning, refers to bias born out of individual items of the measurement instrument. A straightforward example would be to test a (non-UK) European developer about coding snippets dealing with imperial system units. A more common item bias is about the wording of items. Even among native speakers, the use of idioms such as double negatives can cause confusion. Asking a developer to mark a coding snippet that is free from logic and syntax errors is clearer than asking to mark code that does not possess neither logic nor syntax errors. A systematic identification of item bias that goes beyond carefully checking an instrument is to carry out an item analysis with all possible groups of potential participants, for example men and women, or speakers of English on different levels. A comparison of the facility values (the proportion of correct answers) of each item can reveal potential item bias. For instruments that assess traits and characteristics of a group instead of function or skills, a strategy is to follow a checklist of questions that researchers and pilot participants can answer [60] . Differential item functioning (DIF) is a statistical characteristic of an item that shows potential unfairness of the item among different groups that should provide same test results otherwise [103] . A presence of DIF does not necessarily indicate bias but unexpected behavior on an item [3] . This is why, after the detection of DIF, it is important to review the root causes of the differences. Whenever DIF happens for many items of a test, a test construct or final score is potentially unfair among different groups that should provide same test results otherwise. This situation is called differential test functioning (DTF) [118] . There are three main techniques for identifying DIF, namely Mantel-Haenszel approach, item response theory (IRT) methods, and logistic regression [153] . Intrinsic test bias occurs when there are differences in the mean scores of two groups that are due to the characteristics of the test itself rather than difference between the groups in the constructs being measured. Measurement invariance is the desired property upon lack of which intrinsic test bias occurs. If a test for assessing the knowledge of software quality is developed in English and then administered to individuals who are not fluent in English, the measure for the construct of software quality knowledge would be contaminated by a measure of English proficiency. Differential content validity (see section 10.2) is the most severe form of intrinsic test bias as it causes lower test scores in different groups. If a measurement instrument for debugging skills has been designed by keeping in mind American software testers, any participant that is not an American software tester will likely perform worse on the test to different degrees. Rust [121] reports various statistical model proposals over the last 50 years to detect intrinsic test bias which, however, present various issues including the introduction of more unfairness near cut-off points or for certain groups of individuals. There is not a recommended way to detect intrinsic test bias other than perform item bias analysis paired with sensitivity analysis. Extrinsic test bias occurs whenever unfair decision happens based on a non-biased test. These issues usually belong to tests about demographics dealing with social, economical, and political issues, so they are unlikely to belong to measurement instruments developed for the software engineering domain. The present paper only scratches the surface of psychometric theory and practice, and it is its aim to be broad rather than deep. We collect, in this section, what we consider to be good next steps for a better understanding and expansion of the concepts that we have presented. The books written by Coaley [20] , Kline [75] , Nunnally [98] , Rust [121] provide an overall overview of psychometric theory, cover all topics mentioned in the present paper, and more. In particular, we invite to compare how they present measurement theory and their views and classifications of validity and reliability. A natural followup is The SEPT, [3] , which proposes standards that should be met in psychological testing. While our summary breaks down fundamental concepts and presents them in an introductory way for researchers of behavioral software engineering, our writing can not honor enough the guidelines and recommendations for factor analysis offered by Fabrigar et al. [37] , Russell [120] , Singh et al. [129] , Yong and Pearce [152] . To those we add the work of Zumbo [153] , who have explored, through data simulations, the conditions that yield reliable exploratory factor analysis with samples below 50, which is unfortunately a condition we often live with in software engineering research. Furthermore, we wish to point the reader to alternatives to factor analysis, especially for confirmatory factor analysis (CFA). Flora and Curran [44] analyzed benefits when using Robust Weighted Least Squares (Robust WLS) regression. With a Monte Carlo simulation, they have shown that robust WLS provided accurate test statistics, parameter estimates and standard errors even when the assumption of CFA were not met. Bayesian alternatives for CFA have been proposed in the early 80s already [78] and later expanded to cover the exploratory phase as well, see, for example, the works by Conti et al. [24] , Lu et al. [86] , Muthén and Asparouhov [95] . In the above sections we have pointed to several papers that can provide a modern, Bayesian statistical view of many psychometric analysis procedures. We also note that a more general treatment and overview can be found in Levy and Mislevy [82] . While it is important for a software engineering researcher who wants to use and develop psychometric instruments to know the key concepts and techniques of the more classical, typically frequentist, psychometric techniques one can then switch to a Bayesian view either on philosophical or for practical reasons (a simpler more unified treatment, for one). Within the software engineering domain, Gren [56] has offered an alternative lens on validity and reliability of software engineering studies, also based on psychology, that we advise to read. Ralph and Tempero [109] have offered a deep overview of construct validity in software engineering through a psychological lens. We do want to note that one aspect of the method we have used, that can be seen as a limitation, is that there is much current discussion about the statistical methods that are and/or should be applied in behavioral and social science, including psychology [122, 139] , as well as in applied sciences in general [142] . This has also affected software engineering and, for example, a recent paper argued for transitioning to Bayesian statistical analysis in empirical software engineering [47] . However, it is too early to base guidelines on proposals in this ongoing, scholarly discussion since there is not yet a clear consensus. Thus, since we base our review on the current and more established literature, it is likely that future work will need to consider more powerful and up-to-date statistical methods for the creation and assessment of psychometric instruments. Thus, we foresee future updates to this paper that extends it by using such, more recent analysis methods. Our article has insofar showed appreciation of psychometrics and how they could be adopted in the software engineering research fields. We should balance this with reasons for not adopting psychometrics, which we also recommend as a further reading 4 . 4 Following principles of open science, we deposited revisions of the present article on arXiv as we wrote it, submitted for peer review, and revised during peer review. One of the good aspects of depositing manuscripts on arXiv is to attract feedback. A previous version of the present paper, deposited on arXiv with identifier 2005.09959v2, was used as a basis by Lewis [83] to offer a critique of our stance on psychometrics as well as a broader warning on relying on psychometrics in absolute. The present paper takes Lewis's feedback into account, especially on the wording and on the limitations of psychometrics. We recommend reading Lewis's paper as it offers a view on psychometrics that, on several points, is opposed to ours. Psychometrics are not universally accepted as a perfect tool for assessing psychological constructs. Fronts of discontents of psychometrics have emerged, especially in recent times and in the medical, social, and education fields [96, 124, 138] . The widest critique to psychometric-based testing is that it might not be the best tool for assessing individuals. On the technical side, there has been some evidence that alternative evaluation systems which matched individuals to a standardized set of holistic and realistic vignettes improved discrimination of the individual's performance and facilitated the identification of severe, medical issues of some participants [111] . The technical side of critique is, however, limited, as psychometric theory has a long history of robust statistical methods. What causes the widest concern with psychometrics is the central argument, summarized by Schoenherr and Hamstra [124] , that following a psychometric approach might cause to focus too narrowly on characteristics of individuals in terms of dimensions, features, and competences, while at the same time missing on context, uniqueness of individuals, and team-perspective. In other words, psychometric-based assessments potentially neglects a richness of information, much of which comes from direct experience with individuals and, in particular, qualitative data Schoenherr and Hamstra [124] . This missed opportunity may be essential for taking decisions such as related to allocation of resources, promotions, and development or prioritization of skills. Uher [137] has argued that discussions for and against psychometrics so far have swayed on the "qualitative vs. quantitative data" debate, which links to the issue of clashing worldviews in research. Lewis [83] has expressed skepticism that psychometric methods can be applied in software engineering in a successful way, one of the reasons being that an "empirical study that is purely empirical cannot succeed." (p. 40). This pushback can go to an opposite extreme, with Michell [91] defining psychometrics as pathological science because of a lack of testing of the hypothesis that some psychological attributes are quantitative at all and psychometric science is based on accepting this hypothesis into its core. We hope to meet the reader's mercy if we do not cover these topics in a concluding section 5 , especially because we do not see any winning idea regarding epistemological stances. In short, as concisely summarized by Schoenherr and Hamstra [124] , adopting psychometric approaches can be met by a clash between postpositivism and constructivism, or, the debate concerning the adequacy of mapping constructs to numbers and the assumptions that such a process implies. We agree with Lewis [83] that the idea of psychometrics, as well as the one to add numbers to people, is seductive. We do not see this issue confined to psychometrics, though. Our stance is that, should one decide to embark in a fully empirical and quantitative approach to assess psychological constructs in software engineering, the psychometric approach is an effective tool. There is little debate on the effectiveness of psychometric tests as tools. Schoenherr and Hamstra [124] , indeed, argue that "we should not seek to establish a post-psychometric era" (p. 720) and that we should rather focus on the lack of understanding of the theoretical context in which measurement instruments tools have been developed as well as the several, sometimes unresolved, debates within the psychometrics discourse. Our paper is a start with this. On the issue of assigning numbers to individuals, we agree that quantitative studies are not all there is. We want to emphasize the importance of qualitative studies the same way as psychology is rediscovering them. Thurstone [133] , in the plea to render psychological science quantitative, mathematical, and robust, was clear in specifying that "the mathematical and rigorous treatment of a psychological problem is in no sense a substitute for the other forms of exploratory and descriptive types of experimentation" (p. 228). This quote should remind us that quantitative methods, while important, are all but one side of a multifaceted coin that we should supply with a mixed-method mindset that includes qualitative studies and other disciplines. As anticipated in the opening of this paper, we are interested in qualitative research as well. We have offered our proposals for qualitative behavioral software engineering [79] . In Lenberg et al. [79] , we suggest future research in software engineering to benefit from a broader set of methods from qualitative psychology such as interpretive phenomenological analysis, narrative analysis, and discourse analysis. We invite researchers in qualitative studies of software engineering to emphasize reflexivity on how our thinking came to be and how pre-existing understanding is constantly revised in the light of new insights. Finally, we encourage the adoption of qualitative guidelines and criteria to enhance the quality of qualitative studies. To sum up, we see discontent with psychometrics about clashing worldviews, ontological and epistemological issues, and on how psychometric tests are used in decision-making processes as razor-blade cut-off methods for taking complex decisions. There are also limitations brought by following psychometric approaches that are related to practical challenges. The field is enormous-even our introductory paper has passed 150 referenced sources-with many theoretical and practical contributions that require a wide understanding of the concepts at hand, several of which are taken as implicit. Uher [138] , indeed, argues that much issues with psychometrics arise from psychological jargon and underlying, codified conceptual fallacies that bring misconceptions. We believe that basic introductory contributions such as ours help coping with this issue, and we welcome more. Uher [138] provides a comprehensive introduction on terms such as psyche, behavior, constructs, operationalization, variables, attributes, and more, which we recommend reading. Furthermore, we highlight the considerable amount of effort it takes for developing a psychometrically validate test. Gren [56] has argued that "Spending an entire Ph.D. candidacy on the validation of one single measurement of a construct should be, not only approved, but encouraged. ". We agree on that enabling a Ph.D. student to develop a measurement instrument for software engineering constructs following a psychometric approach should be allowed, but we want to emphasize here the "entire Ph.D." part. It might truly take years to developing and validating a measurement instrument. We highly recommend searching the literature for existing, psychometrically validated tests before developing one. All in all, we emphasize our view that we should be systematic and rigorous regardless of worldview, ontological and epistemological stance, and preferred methods. Psychometrics is one of several ways of quantitative research, which we embrace but not blindly believe in. Qualitative research is as important as quantitative research and they complement each other. Qualitative research is essential to discover rich data that captures the uniqueness of individuals as such, rather than pooling people as unique data points. Psychometric methods are an excellent tool to enhance our way to select, develop, understand, and test metrics. Psychometrics should, however, not be justification to build a metric for the sake of it. Proper application of psychometrics does also not promote the validity of results in absolutism. We remind that validity in psychometrics is defined as "The degree to which evidence and theory support the interpretation of test scores for proposed uses of tests. " [3] for which we now highlight the word interpretation. We believe that a methodology description is best complemented by a concrete example of its application. In Appendix A, we thus provide a complete scenario of the development of a fictitious measurement instrument and the establishment of its psychometric properties with the R programming language. The evaluation follows the same structure as the present paper for ease of understanding. In the spirit of open science in software engineering [43] , we provide the running example as a replication package as well [51] . We wrote the example using R Markdown, making it fully repeatable, as well as the generated dataset, and instructions for replication with newly generated data. The adoption and development of validated and reliable measurement instruments in software engineering research, whenever humans are to be evaluated, should benefit from psychology and statistics theory; we need not and should not 'reinvent the wheel'. This paper provides a brief introduction to the evaluation of psychometric tests. Our guidelines will contribute to a better development of new tests as well as a justified decision-making process when selecting existing tests. After providing basic building blocks and concepts of psychometric theory, we introduced item pooling, item review, pilot testing, item analysis, factor analysis, statistical property of items, reliability, validity, and fairness in testing and test bias. In an appendix, we also provided a running example of an implementation of a psychometric evaluation and shared both its data and source code (scripts) openly to promote self-study and a basis for further exploration. We followed textbooks, method papers, and society standards for ensuring a coverage of all important steps, but we could only offer a brief introduction and invite the reader to explore our referenced material further. Each of these steps is a universe of its own, with dozens of published artifacts related to them. A proper psychometric evaluation requires a consideration of all elements reported in the present paper. The development of a measurement instrument, however, does not have to execute every single step we summarize, and not even necessarily in the same order. We are illustrating a wide range of possibilities, some of which are often brought by in future validation studies. Critical thinking is required throughout the whole process, which involves choices and trade-offs. Adding the steps described in this paper will increase the time required for developing measurement instruments. However, the return on investment will be considerable. Psychometric analysis and refinement of measurement instruments can improve their reliability and validity. The software engineering community must value psychometric studies more. This, however, requires a cultural change that we hope to champion with this paper. The present document provides an executable hands-on introductory psychometric validation example written for a behavioral software engineering audience. The present document is part of a paper: Even though this document is as self-contained as possible, we recommend reading the document only after reading the paper. The present document is also an R Markdown file. Its text version is interactive and can be executed directly in R Studio, making it completely reproducible. The present document contains an example, with simulated data, of the psychometric validation phases of a norm-based measurement instrument for assessing an overall construct with five sub-constructs. Our fictitious construct is the "individual perception styles of source code" that, through a literature review (or, perhaps, after a grounded theory study), we believe is mainly composed of, or highly related, to the following five constructs: code curiosity, programming paradigm flexibility, learning disposition, collaboration propensity, and comfort in novelty. 1 We develop a measurement instrument with 31 items, all represented by Likert items ranging from 1 to 5, which is self-assessed by the individual software engineer. The Likert items would ideally form five Likert scales, that would actually emerge after an exploratory factors analysis. However, when we develop scales we have a rough idea of the related constructs anyway. We represent here what we expect the exploratory factor analysis to show, with all items grouped by what we might see as potential factors. • F1, "Code curiosity" -7 items, F1_1 to F1_7 • F2 "Programming paradigm flexibility" -4 items, F2_8 to F2_11 • F3 "Learning disposition" -9 items, F3_12 to F3_20 • F4 "Collaboration propensity" -4 items, F4_21 to F4_24 • F5 "Comfort in novelty" -7 items, F5_25 to F5_31 What the items actually are is not interesting for the purposes of the present document. Our team proceeds to psychometrically evaluate the measurement instrument, in particular, to reduce the number of items. Then, to validate the items belonging to the factors and, possibly, reducing the number of items agin. Finally, to offer a reasoning on statistical properties, reliability, and validity. The following are requirements that are typically found in a basic psychometric validation, with the exclusion of fabricatr, which is for fabricating data and we use for developing this example only. We offer a dataset with pre-populated data that allows repetition of the entire document. For discarding our provided data and simulating new data (which should conform, to a fair extent, to what we expect in the various sections), set the safeguard variable to FALSE or delete the graziotin_et_al-bse_psychometrics_example.csv file. The new dataset will behave very similarly to the provided one-after all, we used the very same code to generate it. This option will allow reproducibility only. For full repeatability, but mostly for better clarity, we recommend to start with our provided dataset. The reason is that the rest of this tutorial will provide reasoning around certain values for items that will slightly change in case a new dataset is generated. The following are options that are useful for repetition and replication of our example. This section explains how we constructed the simulated dataset graziotin_et_al-bse_psychometrics_example.csv. This section can be safely skipped as it does not pertain to psychometric evaluation. We provide the data simulation part for full reproducibility. Likert.brakes <-c(-Inf,-1.5,-0.5, 0.5, 1.5, Inf) Likert.values <-c("1", "2", "3", "4", "5") counter <-0 var.names