us es 1,11/4: ~)b-b BISON An Evaluation of the Upward Mobility Assessment Center for the Bureau of Engraving and Printing m Un~ed States Civil Service Commission Bureau of Policies and Standards Technical Memorandum 76-6 AN EVALUATION OF THE UPWARD MOBILITY ASSESSMENT CENTER FOR THE BUREAU OF ENGRAVING AND PRINTING Hardy L. Hall Applied Psychology Section Personnel Research and Development Center U. S. Civil Service Commission Washington, D. C. July 1976 AN EVALUATION OF THE UPWARD MOBILITY ASSESSMENT CENTER FOR THE BUREAU OF ENGRAVING AND PRINTING ABSTRACT As one method for selecting lower-graded employees for upward mobility positions, the Bureau of Engraving and Printing used an assessment center developed by the Personnel Research and Development Center of the U. S. Civil Service Commission. The assessment center was found to have very high interrater reliabilities between assessors (median = .94) and was better able to differentiate between candidates than were supervisor or self-ratings. Assessment center ratings were less subject to leniency errors and had no significant relationship to candidate sex, age, or experience. Supervisor ratings of some skills were found to be related to age and experience. The assessment center supplied information about a candidate's ability that was not provided by either the supervisor or self-ratings. ACKNOWLEDGMENT The author wishes to thank the following individuals of the PRDC staff for their helpful comments and suggestions; Dr. Richard D. Neidig, Dr. Leon I. Wetrogan, Dr. Charles G. Martin, Dr. Kenneth R. Brown and Cynthia Clark. The assistance of Ruth Christensen in the typing of the report, Lillian Frazier and Maria Williams in the coding of the data, and Ken Kalscheur and L. James Spaulding of BEP in the collection of data is greatly appreciated. CONTENTS List of Skills The Assessment Center Other Candidate Appraisal Methods The Selection Process Interrater Reliability The Relative Importance of the Skill Ratings in Determining the Overall Assessment Center Rating The Ability to Differentiate Between Candidates Objectivity in the Assessment Center Process Relationships Between Different Methods for the Same Skills Summary and Conclusion References Appendices: A. Individual Exercise Observation Form B. Group Discussion Observation Form C. Final Assessment Report D. Supervisory Appraisal Tables: 1. Interrater Reliability Estimates 2. Stepwise Multiple Regression Analysis, Prediction of the Overall Assessment Center Rating 3. Comparisons of Candidates in the Upper 25% and the Lower 75% on Skill Ratings 4. Sign Tests for Means and Standard Deviations 5. Comparisons of Candidates in the Upper 25% and the Lower 75% on Selected Biographical Variables 6. Correlations Between Selected Biographical Variables and Skill Ratings 7. Relationships Between Different Methods the Same for Skills Page 1 2 3 3 4 6 6 8 14 14 16 18 19 20 24 5 7 9 10 12 13 15 AN EVALUATION OF THE UPWARD MOBILITY ASSESSMENT CENTER FOR THE BUREAU OF PRINTING AND ENGRAVING The Personnel Research and Development Center of the U. S. Civil Service Commission was contacted by representatives of the Bureau of Engraving and Printing (BEP), Department of the Treasury, who requested assistance in the development and operation of an assessment center for their agency's Upward Mobility Program. BEP wanted to use a technique which would provide a job-related, objective ap proach for evaluating candidates, beyond available supervisor ratings, unassembled examinations, or other commonly used mea sures. They were particularly concerned with the lack of differentiation afforded by supervisor appraisals in the past. Furthermore, since candidates were to be selected for jobs for which they had not had a chance to demonstrate any necessary abilities, they looked to the assessment center as an excellent opportunity to ob serve a candidate's performance in situa tions which were related to actual target positions. The seven target positions that were identified for the program were: Production controller, accounting techni cian, voucher examiner, supply clerk, physical science technician, management assistant, and engineering draftsman. An earlier publication by the Personnel Research and Development Center (Hall & Baker, 1975) dealt more specifically with the operational aspects of the assessment center. This report deals primarily with the evaluation of specific measurement components of the assessment center. Its purpose is to answer the following questions identified by the investigator as important in evaluating the effectiveness of the assessment center: 1. Did the assessors in each team demonstrate acceptable levels of interrater reliability? 2. Did each skill rating cont~ibute to the overall assessment center rating? 3. Did the assessment center differentiate between candidates better than supervisor or self-ratings? 4. Was the assessment center objective--not significantly affected by variables such as age and sex? 5. Did the assessment center supply information about candidates which supervisor or self-ratings did not? List of Skills A job analysis of the target positions revealed that the following skills were essential for success across all seven jobs: 1. Ability to identify and assimilate relevant data/factors in job-related situations. 2. Ability to look at all possible courses of action and make appropriate decisions. 3. Ability to solve job-reiated mathematical problems accurately. 4. Ability to get along with people and work effectively with them. 5. Ability to express ideas clearly, logically and in the correct grammatical form. 6. Ability to adjust to changes in varying work situations. 7. Ability to be a self-starter who follows through on work assignments. 8. Ability to consistently produce quality work and to be predictable in varying work situations. 9. Is punctual and regularly stays at the job site except during periods of excused absence or leave. While these skills were found to be common for all the target positions, some were more important than others depending upon the particular job, e.g., ability to solve job-related mathematical problems accurately was more important for the position of accounting technician than for management assistant. 1 The Assessment Center Exercises This model consisted of three exercises which required approximately four hours of the candidate's time. The assessment center provided an opportunity for a team of two assessors to observe the candidates under simulated situations with emphasis on observing behaviors related to the essential skills. This model was designed to measure Skills 1, 2, 3, 4, 5, and 6. Skills 7, 8, and 9 could not be measured in the assessment center. The following exercises were used: Analysis Problem. The candidate was given an agency problem which he/she was to work on individually. This problem took approximately one hour and required the candidate to analyze and organize available information in order to make recommendations for handling the problem. Skills 1, 2, and 3 were measured in this exercise. Group Discussion Exercise. A group of 5-6 candidates convened and engaged in a group discussion, each participant having an assigned role. The topic of the group discussion was related to the work of a Federal agency and required a solution. This exercise required approximately one hour and measured Skills 1, 4, 5, and 6. Individual Presentation Exercise. The candidate was given two office situations involving interpersonal problems and was asked to respond to each of the situations. This was an individual exercise which re quired approximately 20 minutes and measured Skills 2, 4, and 5. Drafting Exercise. This was an optional exercise included for candidates who were interested in the engineering drafting series. The candidate had seven drawings to complete and the exercise was objectively scored. Assessor Training Eleven employees, currently enrolled in a management development program, were selected from the Bureau of Engraving and Printing workforce to serve as assessors and administrators for the assessment center operation. These assessors and administrators were representative of work areas in which the target jobs had been selected. The assessor training consisted of two days of on-site training at the agency. The training program was conducted under the direction of Dale Baker and Hardy Hall of the U. S. Civil Service Commission. The training program required the assessors to participate in each of the exercises, including a discussion of the range of behaviors which were observable in each exercise, and required the assessors to observe and rate a group of mock candidates. Reliability checks were made to determine if the assessors were rating candidates on the same basis. The Assessment Center Rating Procedure A total of 82 candidates, ranging in grade from GS (General Schedule) 1 to 7, WG (Wage Grade) 2 to 5, and WP (Printing and Lithographic) 4 to 9, were evaluated by eight assessors over a period of five days. None of the assessors had any prior knowledge of the candidates they assessed. Normally, 15-18 candidates per day were evaluated by three teams of two assessors each. Teams 1 and 2 assessed every day while the members of Team 3 changed every other day. In other words, two assessors from Team 3 (Team 3A) evaluated candidates on the first, third, and fifth days and two different assessors (Team 3B) worked the second and fourth days. This was done to provide a degree of flexibility in scheduling assessors. Each team of assessors assessed from four to six candidates a day. In the Group Discussion Exercise, each assessor observed half of the candidates. In the event that there were five candidates in the discussion, one assessor observed three candidates while the other observed two. For the Analysis Problem, each assessor independently evaluated each candidate's written recommendation and his/her answers to a number of math items relating to the problem. These math items were objectively scorable and ratings ranged from zero to seven. In the Individual Presentation Exercise, 2 both assessors observed one candidate at a time and recorded their observations in dependently. The Drafting Exercise, developed and administered by BEP staff members, was scored by the supervisor of the engineering drafting section. The range of scores was from zero to seven. Only candidates who had previously expressed an interest in the drafting position were administered this exercise. (No attempt was made to evaluate this exercise as too few subjects partici pated.) For reliability purposes, skills were measured at least twice if possible. For example, Skill 1 was measured in the Analysis Problem and in the Group Discussion Exercise. Assessors were trained to note only observable behaviors and not to make inferences. Assessors recorded their initial observations of each candidate's performance on the Individual Exercise Observation Form (See Appendix A) for the Analysis Problem and the Individual Presentation Exercise, and on the Group Discussion Observation Form (See Appendix B) for the Group Discussion Exercise. In those instances where one assessor did not directly observe a candidate, i.e., the Group Discussion Exercise, he/she used the other assessor's observation form. Each behavior was assigned a plus (+) or a minus (-) depending upon whether it was effective or ineffective. After this phase was completed, a rating from one to seven was made independently on each skill by each assessor. This step was taken in order to obtain a measure of interrater reliability. A rating of "one" was "extremely weak" a "four" was "satisfactory", and a "seven" was "outstanding". After all six skills were rated independently by each assessor, assessors in each team discussed their ratings and the behaviors involved in each. They then reached a consensus as to the candidate's candidate's ability on the one to seven scale described above. Assessors were asked not to use •the average of all the skill ratings for the overall rating. Other Candidate Appraisal Methods Supervisor Appraisals Candidates were rated by their respective supervisors on all nine skills on a different 7-point anchored scale which was specially developed for the program by BEP (see Appendix D). These ratings were collected prior to the assessment center. Supervisors were provided with examples of possible behaviors and trained to rate candidates objectively by BEP staff members. Research Data Additional measures on each candidate were also collected; these were self-ratings and biographical data. These data were used solely for research purposes and not for selection decisions. Self-Appraisals. Upon completion of the assessment center, each candidate was asked to rate him-or herself on all nine skills using the same 7-point scale as in the supervisor appraisal. Only 22 candidates completed and returned a selfappraisal. (The form used was essentially the same as the Supervisor Appraisal form, Appendix D.) Biographical Data. The following data for each candidate were collected or computed for analysis: Sex (males were coded "1", females, "2"), age (in years), number of years employed at BEP and number of years in grade (each of which may be interpreted as a rough measure of experience), and current grade. The Selection Process rating on each skill. If a consensus could not be reached, split ratings, e.g., 4/5, were recorded. Ratings were given in integers only. After all skill ratings were discussed, assessors independently arrived at an overall rating of a After all candidates had completed the as~essment center, supervisor ratings and the final assessment center reports were used by BEP officials to rank candidates on a central register. According to the rating plan developed by BEP, assessment center 3 ratings on Skills 1, 2, 3, 4, 5, and 6 were summed and added to supervisor ratings on Skills 7, 8, and 9 (the skills which could not be measured in the assessment center). Candidates were then ranked on the central register on the basis of this total score. Ties were broken by taking into account the mean of all nine skills on the supervisor appraisal. An Upward Mobility Review Panel interviewed the upper 25% of the candidates (23 candidates when ties were considered), and reviewed their folders which contained assessment center reports , supervisor appraisals, vocational interest essays (dealing with candidates' interests in a particular target job), and qualifications and skills surveys. For each target posi tion, the panel established a register on which the candidates were rank ordered. As the final phase of the selection process, the supervisor of the position to be filled interviewed the top 5 candidates on the register for that position, reviewed the folder material and made the final selection. Interrater Reliability Estimates of interrater reliability were determined by correlating (Pearson r) the skill and overall ratings between assessors in each team. Table 1 presents the various reliability estimates and sample sizes. Reliabilities were obtained for each team for each day as well as across days. Reliabilities ranged from .32 to 1.00 with a median of .95. Median reliabilities were determined across teams by day (the last column of the table) and across teams and days (the last column of the last row). Reliabilities were fairly high, which tends to verify the findings of other investigators. For example, Dicken and Black (1965) reported interrater reliabilities in two samples of from .68 to .99 with a median of .89 and from .85 to .98 with a median of .92. Bray and Grant (1966) found reliabilities of from .60 to .75 and Thomson (1970) reported reliabilities for psychologist-assessors and manager-assessors ranging from .73 to .93 and from .78 to .95 with median reliabilities of .85 and .89, respectively. Other investigators have reported similar findings (McConnell & Parker, 1972; Greenwood & McNamara, 1967). In those cases in the present study where low reliabilities were found, i.e., .32, .56, it should be noted that small sample sizes were involved (in each cell of Table 1) and assessors did not differ by more than one scale point in their ratings. There are a number of possible explanations for the high interrater reliabilities found in this study: -Some ratings were based on written assessee products, which tend to be more reliable than ratings of non-written behavior (Howard, 1974). -Group Discussion Observation Forms were shared between assessors. -Comparatively few skills were measured. Assessors were given special training. Since the discussion of skill ratings occurred just prior to arriving at the overall rating, the correlations of the "independent" overall ratings made by each asses_sor probably are inflated. Furthermore, a survey of team assessors revealed that Team 3B assessors discussed their "independent" ratings before they recorded them, which was probably why some reliabilities were 1.00. As a result of this finding, median (mean reliabilities where appropriate) reliabilities across teams for each day and across teams and days (the last column of Table 1) were computed without the data from Team 3B. In addition, the median reliability W:::lS redetermined for Days 1 through 5 and Teams 1 through 3A, excluding the reliabilities of the overall ratings. The median reliability was found to be .94. The range of reliabilities remained the same, .32 to 1.00. The data show that for the five skills (Skill 3 was excluded because it was ob jectively measured), there was, in most cases, a high degree of relationship be tween the assessors' "independent" ratings. An important requirement of an effective applicant appraisal method is that it pos sess substantial interrater reliability; very high interrater reliability has been demonstrated for tlte assessment center in this study. 4 TABLE 1 Interrater Reliability Estimates Team Day Skill 1 2 3A 3B Across Teams a 1 .85 .87 .95 .87 2 . 94 1.00 .80 .94 4 1.00 .92 .90 .92 1 5 . 94 .96 .97 .96 6 .86 .87 1.00 .87 Overall 1.00 1.00 1.00 1.00 n=6 n=5 n=4 n=l5 1 .63 1.00 1.00 .87 2 .94 •95 1.00 .95 4 .87 1.00 1.00 .94 2 5 .85 .64 .96 .75 6 1.00 1.00 1.00 1.00 Overall 1.00 .94 1.00 .97 n=6 n=4 n=5 n=lO 1 .79 .96 .95 ·95 2 .80 .87 1.00 .87 4 .85 1.00 .94 .94 3 5 .95 1.00 .92 .95 6 .32 1.00 .93 .93 Overall 1.00 .96 .84 .96 n=6 n=6 n=6 .n=l8 1 .92 .94 LOO .93 2 .98 .97 1.00 .93 4 .87 .91 1.00 .89 4 5 .96 .96 1.00 .96 6 .94 .93 1.00 .94 Overall 1.00 .56 1.00 .78 n=5 n=6 n=4 n=ll 1 .86 1.00 .1:14 .94 2 .94 .89 .97 .94 4 .96 1.00 .94 .96 5 5 .98 .99 .96 .98 6 Overall .94 . 94 .90 .95 .95 .91 .94 .94 1 n=6 .89 p.=6 .95 n=6 .95 1.00 n=l8 . 95 2 .95 .91 97 1.00 .95 Across 4 .93 . 98 .93 1.00 .93 Days 5 .92 .95 .93 .96 .93 6 ,95 93 .95 1.00 .95 Overall .98 .87 .95 1.00 .95 m=29 n=27 n=l6 n=9 n=72b Note. n = number of candidates. aMedian or mean reliabilities were determined without Team 3B included. bOne candidate was evaluated individually, therefore total does not reach 73 5 The Relative Importance of the Skill Ratings in Determining the Overall Assessment Center Rating The extent to which each skilL rating contributed to the overall assessment center rating was determined by inspection of the beta weights, computed by using the skill ratings as predictors in a multiple regression equation with the overall rating as the criterion or dependent variable. Darlington (1968) has stated that such weights may be used as estimates of the "importance" of causal relationships. McNemar (1969) has also stated that the magnitude of squared beta weights may be used as estimates of relative importance. Stepwise multiple regression, which indicates only those skills which contributed significantly to the prediction of the overall rating, was conducted. The F ratio required for entering each variabl; was arbitrarily set at 1.5 and the F ratio to delete at 1.0. All six skills w;re entered; however, F tests for significant increases in R2 showed that the last skill to be entered~ Skill 4, did not contribute significantly to increased predictability. Table 2 gives the multiple R and R2 at each step, the beta weights in the fin~l step, the simple correlations of each skill with the dependent variable, and the partial correlations (with the dependent variable) in the final step. Rank ordering from largest to smallest beta weight, the skills were: 2, 5, 1, 6, 3, and 4. It should be noted that the use of the simple correlations between each skill and the overall rating would not necessarily result in the same rank order as that produced by the beta weights, since for bivariate correlations no consideration is given to the interrelationships among the independent variables themselves. The use of beta weights allows one to estimate the unique or independent contribution of a particular skill to the overall rating. In terms of prediction, the larger a beta weight, the more that variable adds to the predictability of the dependent variable. For example, increasing Skill 2 by 1 unit increases the overall rating by .4159 units; increasing Skill 6 by 1 unit, increases the overall rating by .1423 units. No determination could be made, by the use of this technique, as to the apparent weight the assessor teams attached to the various skills. Beta weights are based upon correlations among independent variables and correlations between independent variables and the dependent variable, thus obscuring any judgmental or clinical weighting. Nevertheless, the results of this analysis show that Skill 2 --"the ability to look at all possible courses of action and make appropriate decisions" --supplied the most important information about a candidate's overall assessment center rating. In fact, this skill when used as a sole predictor accounted for 80% of the variance of the overall rating. It also had the highest partial with the overall rating in the final step, £ = .61. While the other five skills correlated highly and significantly (p<.OOl) with the overall rating, due to relatively high intercorrelations among skills only four of the five added significant variance. Some of the information these skills added was redundant--already supplied by Skill 2. In this particular context, it was apparently not necessary to measure Skill 4 in order to arrive at an overall rating. This is not to say that Skill 4 is not important or relevant for assessment, since the relationship of the overall rating to job success has not yet been empirically demonstrated. Furthermore, the measurement of Skill 4 would be necessary in order to provide candidates with individual feed back for development purposes. The Ability to Differentiate Between Candidates If the assessment center is to be an effective and valuable measurement device, it must be capable of differentiating be tween candidates. In the selection proce dure, candidates were divided into two groups --those that were in the upper 25% and those in the lower 75% on the central register. Differences in ratings between these groups on each skill for each method (assessment center, supervisor, and self ratings) were compared by means of one-way analysis of variance. 6 ( TABLE 2 Stepwise Multiple Regression Analysis, Prediction of the Overall Assessment Center Rating Step Skill Entered Multiple Ji. _x_ Beta Weight in Final Step Simple E_ with Dependent Variable 1 2 .8937 .7987 .4159 .8937 2 5 .9334 .8711 .2458 .7538 3 6 .9491 .9007 .1423 .7607 4 3 .9546 .9113 .1093 .5869 5 1 .9573 .9163 .1490 .8574 6 4 .9591 .9198 .0899 .6975 Partial r in Final Step .6105 .5357 .2910 .2967 .2520 .2048 7 Sample sizes, means, and standard de assessment center ratings. What is interest viations of ratings for the total group, for ing though, is the fact that there were no the upper 25%, and for the lower 75% appear significant differences found for Skills 7, The last column of the table 8, and 9 as measured by supervisor ratings. in Table 3. These three skills were to have received ashows the corresponding F ratio for each weight of one third in the ranking of candi group comparison. dates. As it was, they probably received The probability of Type I error was inconsiderably less weight and assessment flated because of the multiple comparisons center ratings received considerably more Guilford and Fruchter (1973) involved, the unequal sample sizes and the weight. pointed out that ratings are automatically possible lack of homogeneity of variance. weighted in proportion to the size of their According to Myers (1972), however, this error rate can be deflated if there is a variation. Because of their larger varipositive relationship between sample size ances, assessment center ratings probably and variance. For example, for Skill 4, as contributed proportionately more to the sum measured in the assessment center, the lower of the nine ratings, which was used to rank 75% group had the larger sample size (n = 59 candidates on the central register. as compared to n = 23) and the larger ~ariance (~2 = 1.64-as compared to ~2 = .64); To examine the degree of variability in therefore, the probability of Type I error was reduced. For approximately one third of the ratings, sign tests (Siegel, 1956) were employed to compare the standard deviations the comparisons in Table 3, this positive of assessment center ratings on Skills 1 relationship existed. However, since the through 6 with the standard deviations of bothprobability of Type I error still remained high, the acceptable alpha level was set to supervisor and self-ratings on the same sk~ .01. Table 4 shows, for all comparisons, The data in Table 3 show that: that the standard deviations of assessment center ratings were larger than those of the other methods (£ = .016). Restricted ranges -For comparisons on each of the skills were more evident in supervisor and self measured in the assessment center, the That is, assessors tended to makelevels of significance were all beyond .001. ratings. greater use of the full range of possibleThe skill ratings were significantly higher ratings than did supervisors or candidates. for the upper 25% than for the lower 75% These results and those reported earlier in group. this section demonstrate that the assessment center was better able to differentiate -There was only one I ratio which was between candidates. significant (Skill 2, E<·Ol) for the super visor skill comparisons. The ratings on Skill 2 were significantly higher for the upper 25% group than for the lower 75% Objectivity in the Assessment Center Process group. -None of the I ratios were significant The fourth question this study was to answer was whether the assessment center for self-rating skill comparisons --there process was objective in evaluating candi were no significant differences between the two groups (upper and lower) on the nine dates. skill ratings. However, as there were only 14 candidates in the upper 25% group who had completed self-ratings and 8 in the lower Leniency Error group, the samples may not be representa In order to investigate the question of tive. whether leniency error --the tendency to use the high end of the scale exclusively - Both supervisor and self-ratings failed to differentiate adequately between those existed, sign tests were used to compare the means and standard deviations of assessmentcandidates who placed in the upper 25% and center ratings with the means and standard those who placed in the lower 75% on the central register. This is not too surdeviations of both supervisor and self ratings on the same skills. prising given the weight placed on 8 TABLE 3 Comparisons of Candidates in the Upper 25% and in the Lower.75% on Skill Ratings Total Group _ Upper 25% Method Skill 1 82 3.06 1.49 23 4.57 1.31 2 82 3.26 1.68 23 5.04 1.40 3 82 1.91 1.70 23 3.26 2.14 Assessment Center 4 82 4.74 1.48 23 6.22 .80 5 82 4.37 1.33 23 5.43 1. 20 6 81 3.83 1.72 23 5.57 1.08 Overall 82 3.72 1.46 23 5.30 1.06 1 79 5.10 1.15 22 5.55 1.30 2 73 5.21 1.12 22 5. 77 1. 07 3 61 5.05 1.07 18 5.56 1.10 4 82 5.38 1. 20 23 5.65 1.11 Supervisor 5 81 5.25 1.12 22 5. 73 1.16 Rating 6 82 5.17 1.16 23 5. 65 1.07 7 82 5. 21 1.19 23 5.61 1.34 8 82 5.32 1.12 23 5.74 1.18 9 82 5. 51 1. 60 23 5.91 1.78 1 22 5. 64 1. 26 14 6. 00 1.04 2 22 6.41 • 96 14 6.14 1.10 3 22 5.45 1.06 14 5.64 1.01 4 22 6.45 .86 14 6.57 .76 Self 5 22 5.50 1.30 14 5.43 1.28 Rating 6 21 6.43 .87 13 6.31 .95 7 22 6.05 1.05 14 6.00 1.11 8 22 6.14 • 99 14 6.14 1.03 9 22 5.86 1.39 14 6. 07 1.38 *.E.< •05 **_p_ <•01 ***£. <. 001 Lower 75% E' Ratio 59 2.47 1.10 53.40*** 59 2.56 1. 21 64.08*** 59 1.39 1.15 26.26*** 59 4.17 1.28 51.31*** 59 3.95 1.14 27.44*** 58 3.14 1.42 54.54*** 59 3.10 1.08 69.62*** 57 4.93 1.05 4. 77* 51 4.96 1.06 9.02** 43 4.84 1.00 6.20* 59 5.27 1.23 1.67 59 5.07 1.06 5.85* 59 4. 98 1.15 5.80* 59 5.05 1.11 3.73 59 5.15 1. 06 4.74* 59 5.36 1.52 2.02 8 5.00 1.41 3.64 8 6.88 .35 3.29 8 5.13 1.13 1.24 8 6.25 1.04 .70 8 5.63 1.41 .11 8 6. 63 • 74 .65 8 6.13 .99 .07 8 6.13 .99 .00 8 5.50 1.41 .85 9 TABLE 4 Sign Tests for Means and Standard Deviations Direction Comparison Method SUPR > AC* Mean AC vs SUPR AC vs SELF SELF> AC* Standard AC vs SUPR AC > SUPR* AC vs SELF AC > SELF* Deviation Note. AC = assessment center ratings. SUPR supervisor ratings. SELF = self-ratings. *~ = .016 (one-tailed test) 10 As can be seen from Table 4, means on all assessment center skill ratings were smaller (~ = .016) than those on supervisor or self-ratings. Standard deviations for all assessment center ratings were larger (~ = .016) than those for supervisor and self-ratings. In addition, for five out of six skills, assessment center ratings were closer to the mid-point of the rating scale than were either supervisor ratings or selfratings. The size of standard deviations of assessment center ratings indicates that these were not central tendency errors. These findings indicate supervisors tended to rate candidates consistently higher than did assessors on Skills 1 through 6. Likewise, candidate's ratings of themselves were higher than assessor ratings. Leniency error, a major problem in discriminating among candidates for selection purposes, was not found in the assessment center ratings in this study. Other Possible Sources of Error Other possible sources of error were examined by (a) determining whether there were significant differences between candidates in the upper 2S% and in the lower 7S% on biographical variables such as sex, age, time at BEP, time in grade, and grade level, and by (b) determining whether there were significant correlations between skill ratings and sex, age, time at BEP, time in grade, and grade level. Comparisons between upper and lower groups on biographical data. Table S includes sample sizes, means, and standard deviations for the total group, the upper 2S% and the lower 7S% on the following variables: Sex (males were coded "1", females, "2"), age (in years), time at BEP (in years), time in grade (in years) and grade level. (Since there were essentially three categories of grade --General Schedule (GS), Wage Grade (WG), and Printing and Lithographic (WP) --grade level was analyzed separately for each.) There were no statistically significant differences between the upper and lower groups on any of the aforementioned variables, except for those in the GS category. Those GS candidates in the upper group had significantly higher (~<.OS) grades than those in the lower group. For reasons cited earlier --the unequal sample sizes and the possible heterogeneity of variance --the increased probability of a Type I error may cloud the results of this finding. Relationships between skill ratings and biographical data. Correlations for each method between skill ratings and each of the following variables are presented in Table 6: Sex, age, time at BEP, time in grade, and grade level for WG and GS candidates. As evidenced in Table 6: -There are a number of significant and positive correlations between skill ratings and GS grade level for assessment center ratings. Two correlations were significant beyond the .001 level (Skill 2 and the overall rating), two were significant beyond the .01 level (Skills 1 and 3), and one beyond the .OS level (SkillS). There were no significant correlations between assessment center ratings and sex, age, time at BEP, time in grade, or WG grade level. -Correlations between supervisor ratings on Skills 1, 4, S, and 6 and time at BEP were negative and significant (.E_<.OS). This indicates a tendency for candidates who have worked longer at BEP to be rated lower by their supervisors. The correlation between supervisor ratings on Skill 4 and age was also negative and significant at the .OS level indicating a tendency for older candidates to be rated lower. Skill 3 was positively correlated with GS grade level for supervisor ratings (~,.OS). -For self-ratings, there was a significant and positive correlation between Skill 7 and sex. Female candidates rated themselves significantly higher on this skill than did males. Skill 7 was also positively correlated with time at BEP (~<.OS), demonstrating a tendency for candidates who had worked at BEP longer to rate themselves significantly higher. Skill 9 was negatively correlated with time in grade (p<.OS). This indicated that candidates who had more time in grade tended to rate themselves significantly lower on this particular skill. As far as grade level was concerned, higher graded GS candidates rated themselves significantly higher on Skill 3 (~<.OS) and higher graded WG 11 TABLE 5 Comparisons of Candidates in the Upper 25% and in the Lower 75% on Selected Biographical Variables Total GrouE DEEer 25% Lower 75% N M ~R I Ratio Variable N M SD N t1 S.ll Sexa 82 1. 70 .46 23 1. 70 .47 59 1.69 .46 .00 Ageb 81 35.67 11.22 22 35.55 11.79 59 35.71 11.10 . 00 Time atBEPb 81 7.69 6. 92 22 7.82 7.85 59 7.64 6.61 . 01 Time inGradeb 81 3.25 2.17 22 3.68 2.53 59 3.08 2.01 1.22 Grade 53 3.87 1.21 14 4.43 1.40 39 3.67 1.08 4.35* GS WG 26 3.00 1.41 8 3.13 .1.55 18 2.94 1.39 .09 0 0.00 0.00 2 6.50 3.54 WP 2 6.50 3.54 Note. GS = General Schedule. WG = Wage Grade. WP Printing and Lithographic. aMales were coded "1", females, "2". brn years *.E..< •05 12 TABLE 6 Correlations Between Selected Biographical Variables Method Skill 1 2 3 Assessment 4 Center 5 6 Overall 1 2 3 Supervisor 4 Rating 5 6 7 8 9 1 2 3 Self-4 Rating 5 6 7 8 9 *.E..< •OS **.E..< •01 ***.E..'. 001 Sex -.04 -.09 -.05 -.15 -.02 -.08 -.13 .08 .17 .06 -.01 •02 .17 .20 .00 .18 .07 .38 .07 .09 .00 .15 .53* .40 •OS and Skill Ratings Age -.05 Time at BEP .06 Time in Grade .09 -.15 .02 .01 -.03 .09 .10 -.05 -.13 .01 .05 .07 .12 -.10 -.09 -.02 -.09 .03 .05 -.21 ·-.26* -.ll -.09 -.17 -.09 -.18 -.20 -.16 -.23* -.25* -.18 -.22 -.22* -.19 -.17 -.26* -.17 -.04 -.09 -.15 -.21 -.22 -.06 .15 .01 -.19 -.29 -.12 .13 -.03 .21 -.08 -.21 -.05 .17 .213 .15 .07 -.38 .15 -.33 -. 04 .29 .21 .14 .43* .29 .03 -.09 .13 .20 -.33 -.53* Grade GS w~ .44** .09 .47*** .01 .41** -.06 .15 .09 .33* .27 .22 .03 .46)\** . 07 .ll -.12 .10 .00 .34* .23 .03 -.29 .08 -.29 .08 .03 .14 .03 .20 -.13 -.04 -.13 .39 -.32 .19 .00 .59* -.32 .12 .26 .02 -.53 .24 .10 .37 1. 00***" .36 .25 .00 -.63 13 candidates rated themselves significantly higher on Skill 7 (£Z.001). Tests of differences (t tests) between correlations were conducted-to determine if the correlations of the assessment center ratings with time at BEP and age were significantly different from those for supervisor ratings. This analysis revealed that only two correlations, those for Skills 1 and 5 on time at BEP, were significantly different (£<.05). Discussion. Several interesting relationships existed between GS grade level and assessment center skill ratings as well as between group membership--upper 25% or lower 75%--and GS grade level. Since assessors were not aware of candidates' grade levels, and since neither age itself nor experience (time at BEP) had a relationship with skill ratings, perhaps some other factor, such as competency, was indicated by grade level. If grade level reflects competency, then the higher a candidate's GS grade level, the higher his/her level of ability. This same relationship did not result between WG grade level and assessment center ratings, however. Apparently, the same factors were not reflected by WG grade level as by GS grade level. Given a larger sample size and a normal distribution of grade level, perhaps significant correlations would have been found for WG candidates. While higher level GS candidates received higher assessment center ratings on some skills, the assessment center did not favor GS candidates over WG candidates as a group. The correlation between grade category (GS or WG) and group membership (upper 25% or lower 75%) was practically zero (~ = .02). It was not clear whether the significant relationships between assessment center skills and GS grade level reflected competency or another factor. There were no significant correlations between assessment center ratings on any of the six skills and either age or time at BEP. However, for these six skills, supervisor ratings showed that one skill rating (Skill 4) was negatively related to age at a significant level, and four of the six skill ratings (Skills 1, 4, 5, and 6) were negative and significantly related to time at BEP demonstrating potential sources of rater error. While t tests for correlated data indicated that ;nly two of these five significant correlations for supervisor ratings were significantly greater than the corresponding assessment center coefficients, it is inportant to note that these results (applying an a priori alpha level of .OS) indicated potential rater error for supervisor ratings, but not for assessment center ratings. Relationships Between Different Methods for the Same Skills To establish whether the assessment center supplied information about candidates which supervisor or self-ratings did not, correlations between the same skills as measured by different methods were examined. To the extent that assessment center ratings correlate significantly with either supervisor or self-ratings on the same skills, the assessment center is not contributing unique information about possible future job performance (assuming all methods are equally reliable and valid). Table 7 shows correlations between the same skills for each method combination, i.e., assessment center with supervisor ratings, assessment center with self-ratings, and supervisor with self-ratings. The correlations for Skill 1 were significant between assessment center and supervisor ratings (p<.Ol) and between supervisor and self-rati~gs (£<.05). All the correlations between different methods on Skill 3 were significant (p<.OS). Except for these findings, skills Jo not appear to be independent of method. With the exceptions just noted, these data support the hypothesis that the assessment center can provide informatio~ that is not contributed by either supervisor or self-ratings of current job performance. The assessment center can therefore supply unique information, on the same skills, about a candidate's probable performance on a target job. Summary and Conclusion The Civil Service Commission, as requested by BEP, developed an assessment center and trained assessors in support of BEP's Upward Mobility Program. Seven target positions were identified and a job analysis revealed that there were nine skills which 14 TABLE 7Relationships Between Different Methodsfor the Same Skills MethodSkill AC vs SUPR AC vs SELF SUPR vs SELF 1 •08 .59** .51* 2 .21 .00 ·22 3 .32* .53* .48* 4 .12 • 03 .17 5 .12 .13 .40 6 .15 .17 .18 Note AC = Assessment center rating. SUPR Supervisor rating. SELF = Self-rating. *.E.< .05 **.E. <•01 15 were critical to success in all sevel jobs. Candidates participated in a four-hour assessment center designed to evaluate them on six of nine skills. Trained assessors observed 82 candidates in three job-related exercises. The agency also obtained supervisor ratings and used this information in conjunction with assessment center ratings to rank all the candidates on a central register. Candidates in the upper 25% were then assigned to one or more of seven different registers (one for each target job), depending upon how well they matched position requirements. Interrater reliabilities were determined for assessors by teams, days, across days, across teams, and across teams and days. Considerable agreement was found between pairs of raters. The results verify findings from other studies. Certain factors were recognized which might have led to the high reliabilities found here. An examination of the beta weights for assessment center skills revealed that certain skills contributed more to the overall assessment center rating than did others, particularly Skill 2. While four other assessment center skills contributed significantly to the prediction of the overall rating, some of this information was redundant. Greater differentiation between candi dates was afforded by the assessment center than by the other methods. While ratings by supervisors on three skills, which were used to rank candidates for selection pur poses, did not significantly differentiate between those candidates in the upper 25% on the selection register and those in the lower 75%, assessment center ratings were able to differentiate significantly between candidates (~<.001). Supervisor and self-ratings were more restricted in range and subject to leniency errors than were assessment center ratings. There were tendencies for supervisor ratings of some skills to be affected by a candidate's age and length of time at BEP. Assessment center ratings appeared to be free of these errors, but not significantly different from supervisor ratings except in two instances. Assessment center ratings were highly correlated with GS grade level. Considering the available evidence, the assessment center reported in this study seems to be a reliable and objective method which proved to be an extremely valuable technique for differentiating between candidates. The assessment center can be a very useful tool for evaluating candidates in an upward mobility situation where performance on a target job cannot be satisfactorily measured by other methods. This research raises several important questions that suggest the following additional studies: -The exact relationship between grade and assessment center performance should be further examined. -A long term study should be conducted in order to determine the predictive effectiveness of the assessment center approach in this context. -A determination should be made of how much weight an assessor places on a particular skill when arriving at an overall rating and, if in fact, unit weights wn11ld serve just as well. REFERENCES Bray, D. W. & Grant, D. L. The assessment center in the measurement of potential for business management. Psychological Monographs, 1966, 80. Darlington, R. B. Multiple regression in psychological research and practice. ~P--=:s:- ;: can independently identify and as <1l be done. 0 be done. ti ...0"' .2 similate other data that may aid in 0 .~ an easier and faster approach to work iii 0 z V'l accomplishment. JUDGMENT-Ability to look at all possible courses of action and make appropriate decisions. D D Unable to make decisions ap-D D Decisions at times are less than D D Makes decisions consistently D D Looks at various courses of action and makes the best pes propriate to the situation. appropriate to the situation. appropriate to the situation but does not always consider alterna-sible judgment appropriate to the -o ;:<1l >-lives. situation. Q) 0 N ti ...0 ~ .2 0 -~ iii 0 V'l z INITIATIVE-Self-starting; follows through on work assignments. D D Requires constant prodding to D D Requires some prodding to D D Completes work assignments D D Completes work assignmer~ts without prodding and at times without prodding and always complete work assignments. complete assignments. -o initiates action to overcome obstacles initiates action to overcome obsta <1l ;: <1l >-0 and resolve problems. cles and resolve problems. ti ...0 .2 0 .~ iii 0 V'l z ARITHMETIC ABILITY-Ability to accurately solve job-related mathematical problems. the ability to solve all D D Has great difficulty in handling D D Experi&nces slight difficulty in D D Has D D Outstanding arithmetic ability the simplest problems, e. g., solving simple arithmetic prob- simple arithmetic problems as which leads to the completion -o well as some more difficult problems, of difficult and complex problems. <1l multiplication, division, etc. !ems. >> 0 e. g., percentages. lii ti ...0"' .2 0 .~ iii 0 V'l z - APPENDIX D FORM 2224 p,,ge 2) SUPERVISORY APPRAISAL-UPWARD MOBILITY PROGRAM > " along with others nor work effec-Experiences some difficulty in getting Gets along well with others and usu- Cl> .. lively with them. along with others and working effec-5 ally responds in a thoughtful manner ..D v lively with them. to others. 0 .2 0 .~ iO z VI ORAL COMMUNICATIONS-Ability to express ideas clearly, logically and in the corr.ect grammatical form. D I D Unable to communicate in a D D Can communicate in a fairly D D Communicates in a clear and clear, understandable manner. clear and understandable man-understandable manner and" ::Cl> ner but experiences some grammati->-experiences little grammatical diffi-Cl> cal difficulty. culty . .. 5 ..D v 0 .2 0 .~ iO z VI CONSISTENCY-Quality of work performance is consistent and predictable in varying situations. N \.J1 Quality of work is not consist- D D en! and predictable. Cl> " > iii o..o zo D D Quality of work is not always consistent and predictable. FLEXIBILITY-Ability to adjust to changes in varying work situations. D D Unable to adjust to changes in D D Can, but with some difficulty, " varying work demands. adjust to changes in varying Cl>:: work demands. Cl>-..o..D zo ---- D D Quality of consistency and predictability on the job is good >-even in varying situations. I ' .. 0 ·-=u VI "'"' D D Can usually adjust rapidly to changes in varying work de> mands. .... .. 0 ·= u VI "'"' DEPENDABILITY-Punctual; regularly stays at job site except during periods of excused absence or leave. Seldom punctual or at job site D D Cl> when needed. " :: Cl> o..o zo EVALUATOR tPrinl or IYN full name) D D Occasionally not at job site when needed and at times not punctual. SiGNATURE tEt:aluator) Rarely late and can be counted D D on to be at his/her job site >- I ' when needed. .~ .E -u VI "'"' D D Always aware of how his/her actions affect others. Gets along excellently with others, is very tactful and always considerate of the other person's point of view. Excellent ability in expressing D D ideas clearly, logically and in the proper grammatical form. I D D Quality of consistency and predictability on the job is outstanding despite the complexity of the situation. D D Exceptional ability to rapidly adjust to continual changes in work demands. D D Always punctual and never unnecessarily leaves job site. IDATE