us 
es 1,11/4: 
~)b-b 
BISON 
An Evaluation of the Upward Mobility Assessment Center for the Bureau of Engraving and Printing 
m 
Un~ed States Civil Service Commission 
Bureau of Policies and Standards 

Technical Memorandum 76-6 
AN EVALUATION OF THE UPWARD MOBILITY ASSESSMENT CENTER
FOR THE BUREAU OF ENGRAVING
AND PRINTING 

Hardy L. Hall 
Applied Psychology Section
Personnel Research and Development Center
U. S. Civil Service Commission
Washington, D. C.
July 1976 


AN EVALUATION OF THE UPWARD MOBILITY ASSESSMENT CENTER 
FOR THE BUREAU OF ENGRAVING AND PRINTING 

ABSTRACT 

As one method for selecting lower-graded employees for upward mobility positions, the Bureau of Engraving and Printing used an assessment center developed by the Personnel Research and Development Center of the U. S. Civil Service Commission. 
The assessment center was found to have very high interrater reliabilities between 
assessors (median = .94) and was better able to differentiate between candidates than were supervisor or self-ratings. Assessment center ratings were less subject to leniency errors and had no significant relationship to candidate sex, age, or experience. Supervisor ratings of some skills were found to be related to age and experience. The assessment center supplied information about a candidate's ability that was not provided by either the 
supervisor or self-ratings. 
ACKNOWLEDGMENT 
The author wishes to thank the following individuals of the PRDC staff for their 
helpful comments and suggestions; Dr. Richard D. Neidig, Dr. Leon I. Wetrogan, Dr. Charles G. Martin, Dr. Kenneth R. Brown and Cynthia Clark. The assistance of Ruth Christensen in the typing of the report, Lillian Frazier and Maria Williams in the 
coding of the data, and Ken Kalscheur and L. James Spaulding of BEP in the collection of data is greatly appreciated. 
CONTENTS 
List of Skills The Assessment Center Other Candidate Appraisal Methods The Selection Process Interrater Reliability 
The Relative Importance of the Skill Ratings in Determining the Overall Assessment Center Rating The Ability to Differentiate Between Candidates Objectivity in the Assessment Center Process Relationships Between Different Methods for the Same Skills Summary and Conclusion References Appendices: 
A. Individual Exercise Observation Form 
B. Group Discussion Observation Form 
C. Final Assessment Report 
D. Supervisory Appraisal 
Tables: 
1. 	
Interrater Reliability Estimates 

2. 	
Stepwise Multiple Regression Analysis, Prediction of the Overall Assessment Center Rating 

3. 	
Comparisons of Candidates in the Upper 25% and the Lower 75% on Skill Ratings 

4. 	
Sign Tests for Means and Standard Deviations 

5. 	
Comparisons of Candidates in the Upper 25% and the 

Lower 75% on Selected Biographical Variables 

6. 	
Correlations Between Selected Biographical Variables and Skill 
Ratings 


7. 	
Relationships Between Different Methods the Same


for Skills 
Page 
1 2 3 3 4 
6 6 8 14 14 16 
18 
19 
20 
24 
5 
7 
9 10 
12 
13 15 

AN EVALUATION OF THE UPWARD MOBILITY ASSESSMENT CENTER FOR THE BUREAU OF PRINTING AND ENGRAVING 
The Personnel Research and Development Center of the U. S. Civil Service Commission was contacted by representatives of the Bureau of Engraving and Printing (BEP), Department of the Treasury, who requested assistance in the development and operation of an assessment center for their agency's Upward Mobility Program. 
BEP wanted to use a technique which 
would provide a job-related, objective ap
proach for evaluating candidates, beyond 
available supervisor ratings, unassembled 
examinations, or other commonly used mea
sures. They were particularly concerned 
with the lack of differentiation afforded 
by supervisor appraisals in the past. 
Furthermore, since candidates were to be 
selected for jobs for which they had not 
had a chance to demonstrate any necessary 
abilities, they looked to the assessment 
center as an excellent opportunity to ob
serve a candidate's performance in situa
tions which were related to actual target 
positions. The seven target positions that 
were identified for the program were: 
Production controller, accounting techni
cian, voucher examiner, supply clerk, 
physical science technician, management 
assistant, and engineering draftsman. 
An earlier publication by the Personnel Research and Development Center (Hall & Baker, 1975) dealt more specifically with the operational aspects of the assessment center. This report deals primarily with the evaluation of specific measurement components of the assessment center. 
Its purpose is to answer the following questions identified by the investigator as important in evaluating the effectiveness of the assessment center: 
1. Did the assessors in each team demonstrate acceptable levels of interrater reliability? 
2. Did each skill rating cont~ibute to the overall assessment center rating? 
3. Did the assessment center differentiate between candidates better than supervisor or self-ratings? 
4. 
Was the assessment center 
objective--not significantly affected by 
variables such as age and sex? 


5. 
Did the assessment center supply information about candidates which supervisor or self-ratings did not? 


List of Skills 
A job analysis of the target positions revealed that the following skills were essential for success across all seven jobs: 
1. 
Ability to identify and assimilate relevant data/factors in job-related situations. 

2. 
Ability to look at all possible 
courses of action and make appropriate 
decisions. 


3. 
Ability to solve job-reiated 
mathematical problems accurately. 


4. 
Ability to get along with people 
and work effectively with them. 


5. 
Ability to express ideas clearly, 
logically and in the correct grammatical 
form. 


6. 
Ability to adjust to changes in varying work situations. 

7. 
Ability to be a self-starter who 
follows through on work assignments. 


8. 
Ability to consistently produce quality work and to be predictable in varying work situations. 


9. Is punctual and regularly stays at 
the job site except during periods of excused absence or leave. 
While these skills were found to be common for all the target positions, some were more important than others depending upon the particular job, e.g., ability to solve job-related mathematical problems accurately was more important for the position of accounting technician than for management assistant. 
1 
The Assessment Center 
Exercises 
This model consisted of three exercises which required approximately four hours of the candidate's time. The assessment center provided an opportunity for a team of two assessors to observe the candidates under simulated situations with emphasis on observing behaviors related to the essential skills. This model was designed to measure Skills 1, 2, 3, 4, 5, and 6. Skills 7, 8, and 9 could not be measured in the assessment center. 
The following exercises were used: 
Analysis Problem. The candidate was given an agency problem which he/she was to work on individually. This problem took approximately one hour and required the candidate to analyze and organize available 
information in order to make recommendations 
for handling the problem. Skills 1, 2, and 
3 were measured in this exercise. 
Group Discussion Exercise. A group of 
5-6 candidates convened and engaged in a 
group discussion, each participant having 
an assigned role. The topic of the group 
discussion was related to the work of a 
Federal agency and required a solution. 
This exercise required approximately one 
hour and measured Skills 1, 4, 5, and 6. 
Individual Presentation Exercise. The 
candidate was given two office situations 
involving interpersonal problems and was 
asked to respond to each of the situations. 
This was an individual exercise which re
quired approximately 20 minutes and measured 
Skills 2, 4, and 5. 
Drafting Exercise. This was an optional 
exercise included for candidates who were 
interested in the engineering drafting 
series. The candidate had seven drawings to 
complete and the exercise was objectively 
scored. 
Assessor Training 
Eleven employees, currently enrolled in a management development program, were selected from the Bureau of Engraving and Printing workforce to serve as assessors and 
administrators for the assessment center operation. These assessors and administrators were representative of work areas in 
which the target jobs had been selected. The assessor training consisted of two days of on-site training at the agency. The training program was conducted under the direction of Dale Baker and Hardy Hall of 
the U. S. Civil Service Commission. The 
training program required the assessors to participate in each of the exercises, including a discussion of the range of behaviors which were observable in each exercise, and required the assessors to observe and rate a group of mock candidates. Reliability checks were made to determine if the assessors were rating candidates on the 
same basis. 
The Assessment Center Rating Procedure 
A total of 82 candidates, ranging in grade from GS (General Schedule) 1 to 7, WG (Wage Grade) 2 to 5, and WP (Printing and Lithographic) 4 to 9, were evaluated by eight assessors over a period of five days. None of the assessors had any prior knowledge of the candidates they assessed. Normally, 15-18 candidates per day were evaluated by three teams of two assessors each. Teams 1 and 2 assessed every day while the members of Team 3 changed every other day. In other words, two assessors from Team 3 
(Team 3A) evaluated candidates on the first, 
third, and fifth days and two different 
assessors (Team 3B) worked the second and 

fourth days. This was done to provide a 
degree of flexibility in scheduling 
assessors. Each team of assessors assessed 

from four to six candidates a day. 
In the Group Discussion Exercise, each assessor observed half of the candidates. In the event that there were five candidates in the discussion, one assessor observed three candidates while the other observed 
two. 
For the Analysis Problem, each assessor independently evaluated each candidate's written recommendation and his/her answers 
to a number of math items relating to the problem. These math items were objectively scorable and ratings ranged from zero to 
seven. 
In the Individual Presentation Exercise, 
2 
both assessors observed one candidate at a 
time and recorded their observations in
dependently. 
The Drafting Exercise, developed and 
administered by BEP staff members, was 
scored by the supervisor of the engineering 
drafting section. The range of scores was 
from zero to seven. Only candidates who had 
previously expressed an interest in the 
drafting position were administered this 
exercise. (No attempt was made to evaluate 
this exercise as too few subjects partici
pated.) 
For reliability purposes, skills were 
measured at least twice if possible. For 
example, Skill 1 was measured in the 
Analysis Problem and in the Group Discussion 
Exercise. 
Assessors were trained to note only observable behaviors and not to make inferences. Assessors recorded their initial observations of each candidate's performance on the Individual Exercise Observation Form (See Appendix A) for the Analysis Problem and the Individual Presentation Exercise, and on the Group Discussion Observation Form (See Appendix B) for the Group Discussion Exercise. In those instances where one assessor did not directly observe a candidate, i.e., the Group Discussion Exercise, he/she used the other assessor's observation form. 
Each behavior was assigned a plus (+) or a minus (-) depending upon whether it was effective or ineffective. After this phase was completed, a rating from one to seven was made independently on each skill by each 
assessor. This step was taken in order to obtain a measure of interrater reliability. A rating of "one" was "extremely weak" a "four" was "satisfactory", and a "seven" was 
"outstanding". 
After all six skills were rated independently by each assessor, assessors in each team discussed their ratings and the behaviors involved in each. They then reached a consensus as to the candidate's candidate's ability on the one to seven 
scale described above. Assessors were 
asked not to use •the average of all the 
skill ratings for the overall rating. 
Other Candidate Appraisal Methods 
Supervisor Appraisals 
Candidates were rated by their respective supervisors on all nine skills on a different 7-point anchored scale which was specially developed for the program by BEP (see Appendix D). These ratings were collected prior to the assessment center. Supervisors were provided with examples of possible behaviors and trained to rate candidates objectively by BEP staff members. 
Research Data 
Additional measures on each candidate 
were also collected; these were self-ratings and biographical data. These data were used solely for research purposes and not for 
selection decisions. 
Self-Appraisals. Upon completion of the assessment center, each candidate was asked to rate him-or herself on all nine skills using the same 7-point scale as in the supervisor appraisal. Only 22 candidates completed and returned a selfappraisal. (The form used was essentially the same as the Supervisor Appraisal form, Appendix D.) 
Biographical Data. The following data for each candidate were collected or computed for analysis: Sex (males were coded "1", females, "2"), age (in years), number of years employed at BEP and number of years in grade (each of which may be interpreted as a rough measure of experience), and current grade. 
The Selection Process 
rating on each skill. If a consensus could not be reached, split ratings, e.g., 4/5, were recorded. Ratings were given in integers only. After all skill ratings 
were discussed, assessors independently arrived at an overall rating of a 
After all candidates had completed the as~essment center, supervisor ratings and the final assessment center reports were used by BEP officials to rank candidates on a central register. According to the rating plan developed by BEP, assessment center 
3 
ratings on Skills 1, 2, 3, 4, 5, and 6 were summed and added to supervisor ratings on Skills 7, 8, and 9 (the skills which could not be measured in the assessment center). 
Candidates were then ranked on the central register on the basis of this total score. Ties were broken by taking into account the mean of all nine skills on the supervisor appraisal. An Upward Mobility Review Panel 
interviewed the upper 25% of the candidates 
(23 candidates when ties were considered), and reviewed their folders which contained assessment center reports , supervisor appraisals, vocational interest essays 
(dealing with candidates' interests in a particular target job), and qualifications 
and skills surveys. For each target posi
tion, the panel established a register on 
which the candidates were rank ordered. 
As the final phase of the selection process, the supervisor of the position to be filled interviewed the top 5 candidates on the register for that position, reviewed the folder material and made the final selection. 
Interrater Reliability 
Estimates of interrater reliability 
were determined by correlating (Pearson r) 
the skill and overall ratings between 
assessors in each team. Table 1 presents 
the various reliability estimates and sample 
sizes. Reliabilities were obtained for each 
team for each day as well as across days. 
Reliabilities ranged from .32 to 1.00 with a 
median of .95. Median reliabilities were 
determined across teams by day (the last 
column of the table) and across teams and 
days (the last column of the last row). 
Reliabilities were fairly high, which tends to verify the findings of other investigators. For example, Dicken and Black (1965) reported interrater reliabilities in two samples of from .68 to .99 with a median of .89 and from .85 to .98 with a median of .92. Bray and Grant (1966) found reliabilities of from .60 to .75 and Thomson (1970) reported reliabilities for psychologist-assessors and manager-assessors ranging from .73 to .93 and from .78 to .95 with median reliabilities of .85 and .89, respectively. Other investigators have reported similar findings (McConnell & Parker, 1972; Greenwood & McNamara, 1967). 
In those cases in the present study where low reliabilities were found, i.e., .32, .56, it should be noted that small sample sizes were involved (in each cell of Table 1) and assessors did not differ by more than one scale point in their ratings. 
There are a number of possible explanations for the high interrater reliabilities found in this study: 
-Some ratings were based on written 
assessee products, which tend to be more reliable than ratings of non-written behavior (Howard, 1974). 
-Group Discussion Observation Forms were shared between assessors. 
-Comparatively few skills were measured. 
Assessors were given special training. 
Since the discussion of skill ratings occurred just prior to arriving at the overall rating, the correlations of the "independent" overall ratings made by each asses_sor probably are inflated. Furthermore, a survey of team assessors revealed that Team 3B assessors discussed their "independent" ratings before they recorded them, which was probably why some reliabilities were 
1.00. As a result of this finding, median 
(mean reliabilities where appropriate) 
reliabilities across teams for each day and 
across teams and days (the last column of 

Table 1) were computed without the data from Team 3B. In addition, the median reliability W:::lS redetermined for Days 1 through 5 and Teams 1 through 3A, excluding the reliabilities of the overall ratings. The median reliability was found to be .94. The range of reliabilities remained the same, .32 to 
1.00. 
The data show that for the five skills 
(Skill 3 was excluded because it was ob
jectively measured), there was, in most 
cases, a high degree of relationship be
tween the assessors' "independent" ratings. 
An important requirement of an effective 
applicant appraisal method is that it pos
sess substantial interrater reliability; 
very high interrater reliability has been 
demonstrated for tlte assessment center in 
this study. 

4 
TABLE 1 
Interrater Reliability Estimates 

Team  
Day  Skill  1  2  3A  3B  Across Teams a  
1  .85  .87  .95  .87  
2  . 94  1.00  .80  .94  
4  1.00  .92  .90  .92  
1  5  . 94  .96  .97  .96  
6  .86  .87  1.00  .87  
Overall  1.00  1.00  1.00  1.00  
n=6  n=5  n=4  n=l5  
1  .63  1.00  1.00  .87  
2  .94  •95  1.00  .95  
4  .87  1.00  1.00  .94  
2  5  .85  .64  .96  .75  
6  1.00  1.00  1.00  1.00  
Overall  1.00  .94  1.00  .97  
n=6  n=4  n=5  n=lO  
1  .79  .96  .95  ·95  
2  .80  .87  1.00  .87  
4  .85  1.00  .94  .94  
3  5  .95  1.00  .92  .95  
6  .32  1.00  .93  .93  
Overall  1.00  .96  .84  .96  
n=6  n=6  n=6  .n=l8  
1  .92  .94  LOO  .93  
2  .98  .97  1.00  .93  
4  .87  .91  1.00  .89  
4  5  .96  .96  1.00  .96  
6  .94  .93  1.00  .94  
Overall  1.00  .56  1.00  .78  
n=5  n=6  n=4  n=ll  
1  .86  1.00  .1:14  .94  
2  .94  .89  .97  .94  
4  .96  1.00  .94  .96  
5  5  .98  .99  .96  .98  
6 Overall  .94 . 94  .90 .95  .95 .91  .94 .94  
1  n=6 .89  p.=6 .95  n=6 .95  1.00  n=l8 . 95  
2  .95  .91  97  1.00  .95  
Across  4  .93  . 98  .93  1.00  .93  
Days  5  .92  .95  .93  .96  .93  
6  ,95  93  .95  1.00  .95  
Overall  .98  .87  .95  1.00  .95  
m=29  n=27  n=l6  n=9  n=72b  
Note.  n  = number of candidates.  

aMedian or mean reliabilities were determined without Team 3B included. bOne candidate was evaluated individually, therefore total does not reach 73 
5 

The Relative Importance of the Skill 
Ratings in Determining the Overall 
Assessment Center Rating 

The extent to which each skilL rating contributed to the overall assessment center rating was determined by inspection of the 
beta weights, computed by using the skill ratings as predictors in a multiple regression equation with the overall rating as the 
criterion or dependent variable. Darlington (1968) has stated that such weights may be used as estimates of the "importance" of causal relationships. McNemar (1969) has also stated that the magnitude of squared beta weights may be used as estimates of relative importance. 
Stepwise multiple regression, which indicates only those skills which contributed significantly to the prediction of 
the overall rating, was conducted. The F 
ratio required for entering each variabl; 
was arbitrarily set at 1.5 and the F ratio 
to delete at 1.0. All six skills w;re 
entered; however, F tests for significant 
increases in R2 showed that the last skill 
to be entered~ Skill 4, did not contribute 
significantly to increased predictability. 
Table 2 gives the multiple R and R2 at each 
step, the beta weights in the fin~l step, 
the simple correlations of each skill with 
the dependent variable, and the partial 
correlations (with the dependent variable) 
in the final step. Rank ordering from 
largest to smallest beta weight, the skills 
were: 2, 5, 1, 6, 3, and 4. 
It should be noted that the use of the simple correlations between each skill and the overall rating would not necessarily result in the same rank order as that produced by the beta weights, since for bivariate correlations no consideration is given to the interrelationships among the independent variables themselves. The use of beta weights allows one to estimate the unique or independent contribution of a particular skill to the overall rating. In terms of prediction, the larger a beta weight, the more that variable adds to the predictability of the dependent variable. For example, increasing Skill 2 by 1 unit increases the overall rating by .4159 units; increasing Skill 6 by 1 unit, increases the overall rating by .1423 units. 
No determination could be made, by the use of this technique, as to the apparent weight the assessor teams attached to the various skills. Beta weights are based upon correlations among independent variables and correlations between independent variables and the dependent variable, thus obscuring any judgmental or clinical weighting. 
Nevertheless, the results of this analysis show that Skill 2 --"the ability to look at all possible courses of action and make appropriate decisions" --supplied the most important information about a candidate's overall assessment center rating. In fact, this skill when used as a sole predictor accounted for 80% of the variance of 
the overall rating. It also had the highest 
partial with the overall rating in the final 
step, £ = .61. While the other five skills 
correlated highly and significantly (p<.OOl) 
with the overall rating, due to relatively 
high intercorrelations among skills only 
four of the five added significant variance. 
Some of the information these skills added 
was redundant--already supplied by Skill 2. 
In this particular context, it was 
apparently not necessary to measure Skill 4 
in order to arrive at an overall rating. 
This is not to say that Skill 4 is not 
important or relevant for assessment, since 
the relationship of the overall rating to 
job success has not yet been empirically 
demonstrated. Furthermore, the measurement 
of Skill 4 would be necessary in order to 
provide candidates with individual feed
back for development purposes. 
The Ability to Differentiate 
Between Candidates 

If the assessment center is to be an 
effective and valuable measurement device, 
it must be capable of differentiating be
tween candidates. In the selection proce
dure, candidates were divided into two 
groups --those that were in the upper 25% 
and those in the lower 75% on the central 
register. Differences in ratings between 
these groups on each skill for each method 
(assessment center, supervisor, and self
ratings) were compared by means of one-way 
analysis of variance. 

6 
( 
TABLE 2 
Stepwise Multiple Regression Analysis,  Prediction of the  
Overall Assessment Center Rating  
Step  Skill Entered  Multiple Ji.  _x_  Beta Weight in Final Step  Simple E_ with Dependent Variable  
1  2  .8937  .7987  .4159  .8937  
2  5  .9334  .8711  .2458  .7538  
3  6  .9491  .9007  .1423  .7607  
4  3  .9546  .9113  .1093  .5869  
5  1  .9573  .9163  .1490  .8574  
6  4  .9591  .9198  .0899  .6975  

Partial r in Final Step .6105 .5357 .2910 .2967 .2520 .2048 
7 

Sample sizes, means, and standard de	assessment center ratings. What is interest
viations of ratings for the total group, for ing though, is the fact that there were no the upper 25%, and for the lower 75% appear significant differences found for Skills 7, The last column of the table 8, and 9 as measured by supervisor ratings.
in Table 3. 
These three skills were to have received ashows the corresponding F ratio for each weight of one third in the ranking of candi
group comparison. dates. As it was, they probably received The probability of Type I error was inconsiderably less weight and assessment flated because of the multiple comparisons center ratings received considerably more Guilford and Fruchter (1973)
involved, the unequal sample sizes and the weight.
pointed out that ratings are automatically
possible lack of homogeneity of variance. weighted in proportion to the size of their
According to Myers (1972), however, this 
error rate can be deflated if there is a variation. Because of their larger varipositive relationship between sample size ances, assessment center ratings probably 
and variance. For example, for Skill 4, as contributed proportionately more to the sum 
measured in the assessment center, the lower of the nine ratings, which was used to rank 75% group had the larger sample size (n = 59 candidates on the central register. 
as compared to n = 23) and the larger ~ariance (~2 = 1.64-as compared to ~2 = .64); 
To examine the degree of variability in
therefore, the probability of Type I error was reduced. For approximately one third of the ratings, sign tests (Siegel, 1956) were 
employed to compare the standard deviations
the comparisons in Table 3, this positive of assessment center ratings on Skills 1
relationship existed. However, since the 
through 6 with the standard deviations of bothprobability of Type I error still remained 
high, the acceptable alpha level was set to supervisor and self-ratings on the same sk~ 
.01. Table 4 shows, for all comparisons, 
The data in Table 3 show that: 	that the standard deviations of assessment center ratings were larger than those of the other methods (£ = .016). Restricted ranges
-For comparisons on each of the skills 
were more evident in supervisor and self
measured in the assessment center, the 
That is, assessors tended to makelevels of significance were all beyond .001. ratings.
greater use of the full range of possibleThe skill ratings were significantly higher 
ratings than did supervisors or candidates.
for the upper 25% than for the lower 75% These results and those reported earlier in
group. 
this section demonstrate that the assessment center was better able to differentiate
-There was only one I ratio which was 
between candidates.
significant (Skill 2, E<·Ol) for the super
visor skill comparisons. The ratings on 
Skill 2 were significantly higher for the 

upper 25% group than for the lower 75% Objectivity in the Assessment Center 
Process

group. 
-None of the I ratios were significant The fourth question this study was to 
answer was whether the assessment center
for self-rating skill comparisons --there 
process was objective in evaluating candi
were no significant differences between the 
two groups (upper and lower) on the nine dates. 
skill ratings. However, as there were only 
14 candidates in the upper 25% group who had 
completed self-ratings and 8 in the lower Leniency Error 
group, the samples may not be representa

In order to investigate the question of
tive. 
whether leniency error --the tendency to use the high end of the scale exclusively -
Both supervisor and self-ratings failed to differentiate adequately between those existed, sign tests were used to compare the 
means and standard deviations of assessmentcandidates who placed in the upper 25% and 
center ratings with the means and standard
those who placed in the lower 75% on the 
central register. This is not too surdeviations of both supervisor and self

ratings on the same skills.
prising given the weight placed on 
8 
TABLE 3 
Comparisons of Candidates in the Upper 25% and in the 
Lower.75% on Skill Ratings 

Total Group  _  Upper  25%  
Method  Skill  
1  82  3.06  1.49  23  4.57  1.31  
2  82  3.26  1.68  23  5.04  1.40  
3  82  1.91  1.70  23  3.26  2.14  
Assessment Center  4  82  4.74  1.48  23  6.22  .80  
5  82  4.37  1.33  23  5.43  1. 20  
6  81  3.83  1.72  23  5.57  1.08  
Overall  82  3.72  1.46  23  5.30  1.06  
1  79  5.10  1.15  22  5.55  1.30  
2  73  5.21  1.12  22  5. 77  1. 07  
3  61  5.05  1.07  18  5.56  1.10  
4  82  5.38  1. 20  23  5.65  1.11  
Supervisor  5  81  5.25  1.12  22  5. 73  1.16  
Rating  
6  82  5.17  1.16  23  5. 65  1.07  
7  82  5. 21  1.19  23  5.61  1.34  
8  82  5.32  1.12  23  5.74  1.18  
9  82  5. 51  1. 60  23  5.91  1.78  
1  22  5. 64  1. 26  14  6. 00  1.04  
2  22  6.41  • 96  14  6.14  1.10  
3  22  5.45  1.06  14  5.64  1.01  
4  22  6.45  .86  14  6.57  .76  
Self 5  22  5.50  1.30  14  5.43  1.28  
Rating  6  21  6.43  .87  13  6.31  .95  
7  22  6.05  1.05  14  6.00  1.11  
8  22  6.14  • 99  14  6.14  1.03  
9  22  5.86  1.39  14  6. 07  1.38  
*.E.< •05  
**_p_ <•01  
***£. <. 001  

Lower 75%  
E' Ratio  
59  2.47  1.10  53.40***  
59  2.56  1. 21  64.08***  
59  1.39  1.15  26.26***  
59  4.17  1.28  51.31***  
59  3.95  1.14  27.44***  
58  3.14  1.42  54.54***  
59  3.10  1.08  69.62***  
57  4.93  1.05  4. 77*  
51  4.96  1.06  9.02**  
43  4.84  1.00  6.20*  
59  5.27  1.23  1.67  
59  5.07  1.06  5.85*  
59  4. 98  1.15  5.80*  
59  5.05  1.11  3.73  
59  5.15  1. 06  4.74*  
59  5.36  1.52  2.02  
8  5.00  1.41  3.64  
8  6.88  .35  3.29  
8  5.13  1.13  1.24  
8  6.25  1.04  .70  
8  5.63  1.41  .11  
8  6. 63  • 74  .65  
8  6.13  .99  .07  
8  6.13  .99  .00  
8  5.50  1.41  .85  

9 
TABLE 4 
Sign Tests for Means and Standard Deviations 

Direction
Comparison 	Method 
SUPR > 	AC*
Mean 	AC vs SUPR 
AC vs 	SELF SELF> AC* 
Standard 	AC vs SUPR AC > SUPR* 
AC vs 	SELF AC > SELF*
Deviation 
Note. 	AC = assessment center ratings. SUPR supervisor ratings. SELF = self-ratings. 
*~ = .016 (one-tailed test) 
10 
As can be seen from Table 4, means on all assessment center skill ratings were smaller (~ = .016) than those on supervisor or self-ratings. Standard deviations for all assessment center ratings were larger (~ = .016) than those for supervisor and self-ratings. In addition, for five out of six skills, assessment center ratings were closer to the mid-point of the rating scale than were either supervisor ratings or selfratings. The size of standard deviations of assessment center ratings indicates that these were not central tendency errors. 
These findings indicate supervisors tended to rate candidates consistently higher than did assessors on Skills 1 through 6. Likewise, candidate's ratings of themselves were higher than assessor ratings. Leniency error, a major problem in discriminating among candidates for selection purposes, was not found in the assessment center ratings in this study. 
Other Possible Sources of Error 
Other possible sources of error were examined by (a) determining whether there were significant differences between candidates in the upper 2S% and in the lower 7S% on biographical variables such as sex, age, time at BEP, time in grade, and grade level, and by (b) determining whether there were significant correlations between skill ratings and sex, age, time at BEP, time in grade, and grade level. 
Comparisons between upper and lower groups on biographical data. Table S includes sample sizes, means, and standard deviations for the total group, the upper 2S% and the lower 7S% on the following variables: Sex (males were coded "1", females, "2"), age (in years), time at BEP (in years), time in grade (in years) and grade level. (Since there were essentially three categories of grade --General Schedule (GS), Wage Grade (WG), and Printing and Lithographic (WP) --grade level was analyzed separately for each.) 
There were no statistically significant differences between the upper and lower groups on any of the aforementioned variables, except for those in the GS category. Those GS candidates in the upper group had significantly higher (~<.OS) 
grades than those in the lower group. For reasons cited earlier --the unequal sample sizes and the possible heterogeneity of variance --the increased probability of a Type I error may cloud the results of this finding. 
Relationships between skill ratings and biographical data. Correlations for each method between skill ratings and each of the following variables are presented in Table 
6: Sex, age, time at BEP, time in grade, and grade level for WG and GS candidates. 
As evidenced in Table 6: 
-There are a number of significant and positive correlations between skill ratings and GS grade level for assessment center ratings. Two correlations were significant beyond the .001 level (Skill 2 and the overall rating), two were significant beyond the .01 level (Skills 1 and 3), and one beyond the .OS level (SkillS). There were no significant correlations between assessment center ratings and sex, age, time at BEP, time in grade, or WG grade level. 
-Correlations between supervisor ratings on Skills 1, 4, S, and 6 and time at BEP were negative and significant (.E_<.OS). This indicates a tendency for candidates who have worked longer at BEP to be rated lower by their supervisors. The correlation between supervisor ratings on Skill 4 and age was also negative and significant at the .OS level indicating a tendency for older candidates to be rated lower. Skill 3 was positively correlated with GS grade level for supervisor ratings (~,.OS). 
-For self-ratings, there was a significant and positive correlation between Skill 7 and sex. Female candidates rated themselves significantly higher on this skill than did males. Skill 7 was also positively correlated with time at BEP (~<.OS), demonstrating a tendency for candidates who had worked at BEP longer to rate themselves significantly higher. Skill 9 was negatively correlated with time in grade (p<.OS). This indicated that candidates who had more time in grade tended to rate themselves significantly lower on this particular skill. As far as grade level was concerned, higher graded GS candidates rated themselves significantly higher on Skill 3 (~<.OS) and higher graded WG 
11 
TABLE 5 
Comparisons of Candidates in the Upper 25% and in the Lower 75% on Selected Biographical Variables 
Total GrouE DEEer 25% Lower 75% 
N M ~R I Ratio
Variable N M SD N t1 S.ll Sexa 82 1. 70 .46 23 1. 70 .47 59 1.69 .46 .00 
Ageb 81 35.67 11.22 22 35.55 11.79 59 35.71 11.10 . 00 Time atBEPb 81 7.69 6. 92 22 7.82 7.85 59 7.64 6.61 . 01 Time inGradeb 81 3.25 2.17 22 3.68 2.53 59 3.08 2.01 1.22 
Grade 
53 3.87 1.21 14 4.43 1.40 39 3.67 1.08 4.35*
GS WG 26 3.00 1.41 8 3.13 .1.55 18 2.94 1.39 .09 
0 0.00 0.00 2 6.50 3.54
WP 2 6.50 3.54 
Note. GS = General Schedule. WG = Wage Grade. WP Printing and 
Lithographic. 
aMales were coded "1", females, "2". 
brn years 
*.E..< •05 

12 
TABLE 6 Correlations Between Selected Biographical Variables 
Method Skill 1 2 
3 Assessment 4 Center 
5 6 
Overall 1 2 3 
Supervisor 4 Rating 5 6 7 
8 9 1 2 3 Self-4 
Rating 
5 
6 7 8 9 
*.E..< •OS **.E..< •01 ***.E..'. 001 
Sex -.04 -.09 -.05 -.15 -.02 
-.08 -.13 .08 .17 .06 -.01 •02 .17 .20 .00 .18 
.07 
.38 
.07 
.09 
.00 
.15 
.53* 
.40 
•OS 
and Skill Ratings 
Age -.05  Time at BEP .06  Time in Grade .09  
-.15  .02  .01  
-.03  .09  .10  
-.05  -.13  .01  
.05  .07  .12  
-.10  -.09  -.02  
-.09  .03  .05  
-.21  ·-.26*  -.ll  
-.09  -.17  -.09  
-.18  -.20  -.16  
-.23*  -.25*  -.18  
-.22  -.22*  -.19  
-.17  -.26*  -.17  
-.04  -.09  -.15  
-.21  -.22  -.06  
.15  .01  -.19  
-.29  -.12  .13  
-.03  .21  -.08  
-.21  -.05  .17  
.213  .15  .07  
-.38  .15  -.33  
-. 04  .29  .21  
.14  .43*  .29  
.03  -.09  .13  
.20  -.33  -.53*  

Grade  
GS  w~  
.44**  .09  
.47***  .01  
.41**  -.06  
.15  .09  
.33*  .27  
.22  .03  
.46)\**  . 07  
.ll  -.12  
.10  .00  
.34*  .23  
.03  -.29  
.08  -.29  
.08  .03  
.14  .03  
.20  -.13  
-.04  -.13  
.39  -.32  
.19  .00  
.59*  -.32  
.12  .26  
.02  -.53  
.24  .10  
.37  1. 00***"  
.36  .25  
.00  -.63  

13 

candidates rated themselves significantly higher on Skill 7 (£Z.001). 
Tests of differences (t tests) between correlations were conducted-to determine if the correlations of the assessment center ratings with time at BEP and age were significantly different from those for supervisor ratings. This analysis revealed that only two correlations, those for Skills 1 and 5 on time at BEP, were significantly different (£<.05). 
Discussion. Several interesting relationships existed between GS grade level and assessment center skill ratings as well as between group membership--upper 25% or lower 75%--and GS grade level. Since assessors were not aware of candidates' grade levels, and since neither age itself nor experience 
(time at BEP) had a relationship with skill ratings, perhaps some other factor, such as competency, was indicated by grade level. If grade level reflects competency, then the higher a candidate's GS grade level, the higher his/her level of ability. This same relationship did not result between WG grade level and assessment center ratings, however. Apparently, the same factors were not reflected by WG grade level as by GS grade level. Given a larger sample size and a normal distribution of grade level, perhaps significant correlations would have been found for WG candidates. 
While higher level GS candidates received higher assessment center ratings on some skills, the assessment center did not favor GS candidates over WG candidates as a group. The correlation between grade category (GS or WG) and group membership (upper 25% or lower 75%) was practically zero (~ = .02). It was not clear whether the significant relationships between assessment center skills and GS grade level reflected competency or another factor. 
There were no significant correlations between assessment center ratings on any of the six skills and either age or time at BEP. However, for these six skills, supervisor ratings showed that one skill rating 
(Skill 4) was negatively related to age at 
a significant level, and four of the six 
skill ratings (Skills 1, 4, 5, and 6) were negative and significantly related to time at BEP demonstrating potential sources of rater error. While t tests for correlated data indicated that ;nly two of these five significant correlations for supervisor ratings were significantly greater than the corresponding assessment center coefficients, it is inportant to note that these results 
(applying an a priori alpha level of .OS) indicated potential rater error for supervisor ratings, but not for assessment center ratings. 
Relationships Between Different Methods for the Same Skills 
To establish whether the assessment center supplied information about candidates which supervisor or self-ratings did not, correlations between the same skills as measured by different methods were examined. To the extent that assessment center ratings correlate significantly with either supervisor or self-ratings on the same skills, the assessment center is not contributing unique information about possible future job performance (assuming all methods are equally reliable and valid). 
Table 7 shows correlations between the same skills for each method combination, i.e., assessment center with supervisor ratings, assessment center with self-ratings, and supervisor with self-ratings. The correlations for Skill 1 were significant between assessment center and supervisor ratings (p<.Ol) and between supervisor and self-rati~gs (£<.05). All the correlations between different methods on Skill 3 were significant (p<.OS). Except for these findings, skills Jo not appear to be independent of method. 
With the exceptions just noted, these data support the hypothesis that the assessment center can provide informatio~ that is not contributed by either supervisor or self-ratings of current job performance. The assessment center can therefore supply unique information, on the same skills, about a candidate's probable performance on a target job. 
Summary and Conclusion 
The Civil Service Commission, as requested by BEP, developed an assessment center and trained assessors in support of BEP's Upward Mobility Program. Seven target positions were identified and a job analysis revealed that there were nine skills which 
14 

TABLE 7Relationships Between Different Methodsfor the Same Skills 
MethodSkill AC vs SUPR AC vs SELF SUPR vs SELF 
1 •08 .59** .51* 
2 .21 .00 ·22 
3 .32* .53* .48* 

4 .12 • 03 .17 
5 .12 .13 .40 
6 .15 .17 .18 
Note AC = Assessment center rating. SUPR Supervisor rating. SELF = Self-rating. *.E.< .05 **.E. <•01 
15 
were critical to success in all sevel jobs. 
Candidates participated in a four-hour 
assessment center designed to evaluate them on six of nine skills. Trained assessors observed 82 candidates in three job-related exercises. The agency also obtained supervisor ratings and used this information in conjunction with assessment center ratings to rank all the candidates on a central register. Candidates in the upper 25% were then assigned to one or more of seven different registers (one for each target job), depending upon how well they matched position requirements. 
Interrater reliabilities were determined for assessors by teams, days, across days, across teams, and across teams and days. Considerable agreement was found between pairs of raters. The results verify findings from other studies. Certain factors were recognized which might have led to the high reliabilities found here. 
An examination of the beta weights for 
assessment center skills revealed that 
certain skills contributed more to the 
overall assessment center rating than did 
others, particularly Skill 2. While four 
other assessment center skills contributed 
significantly to the prediction of the 
overall rating, some of this information 
was redundant. 
Greater differentiation between candi
dates was afforded by the assessment center 
than by the other methods. While ratings 
by supervisors on three skills, which were 
used to rank candidates for selection pur
poses, did not significantly differentiate 
between those candidates in the upper 25% 
on the selection register and those in the 
lower 75%, assessment center ratings were 
able to differentiate significantly between 
candidates (~<.001). 
Supervisor and self-ratings were more 
restricted in range and subject to leniency 
errors than were assessment center ratings. 
There were tendencies for supervisor 
ratings of some skills to be affected by a 
candidate's age and length of time at BEP. 
Assessment center ratings appeared to be 
free of these errors, but not significantly 
different from supervisor ratings except in 
two instances. Assessment center ratings 
were highly correlated with GS grade level. 
Considering the available evidence, the assessment center reported in this study seems to be a reliable and objective method which proved to be an extremely valuable technique for differentiating between candidates. The assessment center can be a very useful tool for evaluating candidates in an upward mobility situation where performance on a target job cannot be satisfactorily measured by other methods. 
This research raises several important questions that suggest the following additional studies: 
-The exact relationship between grade and assessment center performance should be further examined. 
-A long term study should be conducted in order to determine the predictive effectiveness of the assessment center approach in this context. 
-A determination should be made of how much weight an assessor places on a particular skill when arriving at an overall rating and, if in fact, unit weights wn11ld 
serve just as well. 
REFERENCES 
Bray, D. W. & Grant, D. L. The assessment 
center in the measurement of potential 
for business management. Psychological 

Monographs, 1966, 80. 
Darlington, R. B. Multiple regression in 
psychological research and practice. 
~P--=:s:-<y'-'c:::h:..:.o::.cl::.o=-g=i=c=a=lc_=B:.c:uo.:l::.:l::.:e::.ct::.c1=·n=-, 196 8 , .§2_, 161
182. 
Dicken, C. F., & Black, J. D. Predictive validity of psychometric evaluations of supervisors. Journal of Applied Psychology, 1965, ~. 34-37. 
Greenwood, J. M., & McNamara, W. J. Interrater reliability in situational tests. Journal of Applied Psychology, 1967, il• 101-106. 
16 
Guilford, J. P. Psychometric Methods. New York: McGraw-Hill, 1954. 
Guilford, J. P., & Fruchter, B. Fundamental statistics in psychology and education (5th ed.). New York: McGraw-Hill, 1973. 
Hall, H. L., & Baker, D. R. An overview of the upward mobility assessment center for the Bureau of Engraving and Printing (Technical Memorandum 75-5). Washington, 
D. C.: U. S. Civil Service Commission, 1975. 
Howard, A. An assessment of assessment centers. Academy of Management Journal, 1974, 12(1), 115-134. 
McConnell, H. J., & Parker, J. C. An assessment center program for multiorganizational use. Training and Development Journal, 1972 (March), 6-14. 
McNemar, Q. Psychological statistics (4th ed.). New York: Wiley, 1969. 
Myers, J. L. Fundamentals of experimental design (2nd ed.). Boston: Allyn & Baco~ 1972. 
Siegel, S. Nonparametric statistics of the behavioral sciences. New York: McGrawHill, 1956. 
Thomson, H. A. Comparison of predictor and criterion judgments of managerial performance using the multitrait-multimethod approach. Journal of Applied Psychology, 1970, ~. 496-502. 
17 
APPENDIX A 
BUREAU OF ENGRAVING fu~D PRINTING Upward Mobility Program 
INDIVIDUAL EXERCISE OBSERVATION FORM (Analysis Problem & Oral Presentation Exercise) 
Candidate's Name: 
Assessor's Name: 
Exercise: 
Date: 

18 

APPENDIX l3 
BUREAU OF ENGRAVING AND PRINTING Up1vard 11obility Program GROUP DISCUSSION OBSERVATION FORM Assessor's Name 
Date -------Candidate's
C~ndidate1 s ICandidate's 
Name:
J1a • Name: 
. l 
I-' \0 
APPENDI.X C 

BUREAU OF ENGRAVING AND PRINTING 
Upward Mobility Program 

FINAL ASSESSMENT REPORT 

Candidate's Name: 
Assessor's Name: 
Date: 
Overall Rating Rater 1 Rater 2 Consensus 
SKILL 	Rater 1 Rater 2 Consensus 
1. 	Abilitv to identifv and assimilate relevant data/ factors in job-related situations. 
20 

APPENDIX C 

SKILL 	Rater 1 Rater 2 Consensus 
2. 	Ability to look at all 
possible courses of action and make appropriate decisions 
3. 	Abilitv to solve jobrelated mathematical problems accurately. 
21 

APPENDIX C 

SKILL 	Rater 1 Rater 2 Consensus 
4. 	
Ability to get along with people and work effectively with them. 

5. 	
Ability ~o express ideas clearly, logically, and in the correct grammatical form. 


22 

APPENDIX C 
Rater 1 Rater 2 Consensus 
6. 	Ability to adjust to changes in varying work situations. 
23 

APPENDIX D 

FORM 2224 (Puge lJ 	SUPERVISORY APPRAISAL-UPWARD MOBILITY PROGRAM 
ORIG. 3-75 
COST CEN-TER___ DATE OF EVALUATION
I 	I
'COMPONENT_______
NAME OF EMPLOYEE 
5----6 	7
3

--I 	I I 
ANALYTICAL ABILITY-Ability to identify and assimilate relevant data/factors in job-related situations. 
ability to 	exceptional
D D When given a work assign-D D When given a work assign-D D Demonstrates an D D Demonstrates an quickly and easily understand capacity to not only identify
ment, demonstrates a complete ment demonst;ates slight diffiand identify the necessary elements and understand the necessary e!e
inability to understand and identify culty in understanding and identifying 
-o 
<1l 	of what is to be done. ments of what is to be done, but also 
the necessary elements of what is to the necessary elements of what is to >
;: 
can independently identify and as
<1l be done. 	0
be done. 
ti
...0"' .2 similate other data that may aid in 0 .~ 
an easier and faster approach to work
iii
0 
z V'l accomplishment. JUDGMENT-Ability to look at all possible courses of action and make appropriate decisions. 
D D Unable to make decisions ap-D D Decisions at times are less than D D Makes decisions consistently D D 	Looks at various courses of action and makes the best pes
propriate to the situation. appropriate to the situation. appropriate to the situation but does not always consider alterna-sible judgment appropriate to the 
-o 
;:<1l >-lives. situation. 
Q) 0 

N 	ti
...0 
~ 	.2
0 
-~ 
iii
0 	V'l
z 
INITIATIVE-Self-starting; follows through on work assignments. 
D D Requires constant prodding to D D Requires some prodding to D D Completes work assignments D D Completes work assignmer~ts without prodding and at times without prodding and always
complete work assignments. complete assignments. -o initiates action to overcome obstacles initiates action to overcome obsta
<1l 
;: 
<1l 	>-0 and resolve problems. cles and resolve problems. ti
...0 
.2
0 	.~ 
iii
0 	V'l
z 
ARITHMETIC ABILITY-Ability to accurately solve job-related mathematical problems. 
the ability to solve all
D D Has great difficulty in handling D D Experi&nces slight difficulty in D D Has 	D D Outstanding arithmetic ability 
the simplest problems, e. g., solving simple arithmetic prob-	simple arithmetic problems as which leads to the completion 
-o 
well as some more difficult problems, of difficult and complex problems.
<1l multiplication, division, etc. !ems. >> 0 e. g., percentages.

lii 
ti
...0"' 
.2
0 	.~ 
iii
0 	V'l
z 
-

APPENDIX D  
FORM 2224  p,,ge 2)  SUPERVISORY APPRAISAL-UPWARD MOBILITY PROGRAM <CONTINUED)  
r--=- I  I~  ;- I  4  I  - ___5__  u_L_e_[  7  

HUMAN RELATIONS-Ability to get along with people and work effectively with them. 
Unaware of how his/her ac-Is not always aware of how Almost always aware of how
D D D D 	D D 
lions affect others. Doesn't get his/ her actions affect others. his/her actions affect others. ::Cl> >
" along 	with others nor work effec-Experiences some difficulty in getting Gets along well with others and usu-
Cl>
.. lively with them. 	along with others and working effec-5 ally responds in a thoughtful manner 
..D 	v
lively with them. 	to others.
0 	.2 
0 	.~ 
iO
z 	VI 
ORAL COMMUNICATIONS-Ability to express ideas clearly, logically and in the corr.ect grammatical form. 
D 	I 
D Unable to communicate in a D D Can communicate in a fairly D D Communicates in a clear and clear, understandable manner. clear and understandable man-understandable manner and" ::Cl> ner but experiences some grammati->-experiences little grammatical diffi-Cl> cal difficulty. culty . 
.. 	5 
..D 	v 
0 	.2 
0 	.~ 
iO
z 	VI 
CONSISTENCY-Quality of work performance is consistent and predictable in varying situations. 
N 
\.J1 
Quality of work is not consist-
D D 
en! and predictable. 
Cl> 
" 
>
iii 
o..o 
zo 
D D Quality of work is not always consistent and predictable. 
FLEXIBILITY-Ability to adjust to changes in varying work situations. 
D  D Unable to adjust to changes in D D Can, but with some difficulty,  
"  varying work demands. adjust to changes in varying  
Cl>::  work demands.  
Cl>-..o..D  
zo  
---- 

D D Quality of consistency and predictability on the job is good >-even in varying situations. 
I '
.. 0 
·-=u 
VI
"'"' 
D D Can usually adjust rapidly to changes in varying work de>
mands.
.... 
.. 0 
·= u 
VI
"'"' 
DEPENDABILITY-Punctual; regularly stays at job site except during periods of excused absence or leave. 
Seldom punctual or at job site 
D D 
Cl> when needed.
" 
:: 
Cl> 
o..o 
zo 
EVALUATOR tPrinl or IYN full name) 
D D Occasionally not at job site when needed and at times not punctual. 
SiGNATURE tEt:aluator) 
Rarely late and can be counted
D D 
on to be at his/her job site >-
I '
when needed. 
.~ .E 
-u VI
"'"' 
D D Always aware of how his/her actions affect others. Gets along excellently with others, is very tactful and always considerate of the other person's point of view. 
Excellent ability in expressing
D D 
ideas clearly, logically and in the proper grammatical form. 
I 
D D Quality of consistency and predictability on the job is outstanding despite the complexity of the situation. 
D D Exceptional ability to rapidly adjust to continual changes in work demands. 
D D 	Always punctual and never unnecessarily leaves job site. 
IDATE