College and Research Libraries Research Notes Ratings and Rankings: Multiple Comparisons of Mean Ratings William E. McGrath Ranking of journals or other objects according to mean ratings computed from an opinion sur- vey is shown to be inappropriate if a test of sig- nificance shows no difference between them. A Scheffe test for comparisons of mean ratings of journals ranked by Kohl and Davis [C&RL 46:40-47 (Jan. 1985)] was performed. The results indicate no significant difference be- tween means. Confidence intervals for every adjacent pair of journals in the list of ratings by ARL directors were also computed. The results indicate that every adjacent interval overlaps, and that the means are essentially tie scores. Treating them as significantly different, there- fore, is a Type 1 error. Rank ordering of mean ratings, a common practice in library science research, can lead to serious Type 1 errors if the mean ratings are not first submitted to tests of significance. "Type 1" errors are those in which a hypothesis assuming no differ- ence between two means, say, is actually true but is treated as untrue by the re- searcher. In turn, Type 1 errors, if not rec- ognized, may lead to unjustified social or administrative actions or other errors of judgment or policy. Two examples will illustrate. The first is from my own research some years ago, which inconclusively attempts to correlate mean ratings of subject-area characteris- tics (computed from a 10-point scale) with variables of library circulation. 1 The ab- sence of strong correlations may be attrib- uted to the probable absence of significant differences between the mean ratings of subject areas. Had those differences been tested, the limitations of my design might have been realized. Fortunately the long- term consequences were as negligible as the correlation, as I had merely failed to build good theory. The second example appears in an arti- cle by Kohl and Davis. 2 These authors asked ARL library directors and deans of accredited library schools to rate thirty- one library journals in terms of their im- portance to evaluations of publications by librarians or faculty being considered for promotion and tenure. Each journal title was rated by each respondent on a 5-point Likert scale. The authors computed the mean rating of each journal, then ranked the journals according to these means. As in my own research, the authors did not test to determine whether means were sig- nificantly different from each other- although they did compare directors' rat- ings to deans' ratings. Without such a test there is no evidence that one mean rating is any different from any other. The rankings in question appear in their William E. McGrath is Associate Professor in the School of Information and Library Studies, State University of New York at Buffalo, New York 14260. 169 170 College & Research Libraries table 1. 3 These ranks seem to assume that each mean is different-for example, that the mean for Library Quarterly, 4.4048, is different from that for Journal of Academic Librarianship, 4.3810-when in fact they are probably not different. That is pre- cisely the same error cited in the first example-a Type 1 error. Kohl and Davis, however, did seek to avoid Type 1 errors, first by performing t-tests for the differences between the means of ARL directors and library school deans, then by looking at internal consen- sus. They report the results of that test in their table 2. They conclude, that because deans and directors appear to agree on their ratings of journal "importance," there is a "perceived hierarchy of journal prestige.'' However, their Type 1 errors are be- tween journals, not between deans and di- rectors. Thus, their finding of a "per- ceived hierarchy of journal prestige" is not supported. Although a perceived hier- archy may exist, it cannot be determined from their table 1. Therefore, acceptance of these journal ranks at face value for the purpose of determining promotion and tenure of librarians and faculty could lead to inappropriate evaluation. The small visual differences between the means in table 1 and the small sample size from which the journal means were computed also cast suspicion on conclu- sions drawn from them. The Scheffe test is appropriate for all possible comparisons. 4 The data reported in their tables 1 (mean ratings) and 2 (sample sizes and standard deviations) make it possible to compute an overall mean square within (MSw), which is required to compute an F statistic, which, in tum, is required to perform the test. The equation for F is F = ____ (M_l -_M_2)_2---. 1 1 MSw ( fl;- + I\;"") (k-1) with df = k-1, N -k. (a) Working backwards, it is pos~ible to com- pute MSw from the statistics reported in table 3, as follows: MSw = (ES/n;- ES2; )/(N- k), (b) March 1987 where 52; and nj are the squares of the stan- dard deviations and the sample sizes for each journal respectively. A sample size of 42 for each journal, reported in Kohl and Davis' table 3, is assumed in computing the above equations. The Scheffe test was performed on means of ARL directors' ratings (left column of table 1) but only for the journals in Kohl and Davis' table 3, which contains the standard deviations necessary for the computation. From (b) above, MSw = 2.23. This value was used in (a) to com- pute F values for the Scheffe tests appear- ing in table A. For no adjacent pair of journals did the computed values ofF exceed the test value of 1.57, indicating true null hypotheses in every comparison-i.e., that the means for every adjacent pair in the list are not significantly different from each other. Not until the journal at the top of the rank- ings, College & Research Libraries, was com- pared with one well down in the list, namely Library and Information Science Re- search, was a significant difference ob- served. Furthermore, Library and Informa- tion Science Research is not significantly different from the journals following it in the list. This general lack of significance does not appear to support the rationale for strict ranking of these journals. At best, one might postulate two clusters of journals, with each journal in the first cluster essentially tied for first place and each in the second cluster tied for second place. To paraphrase Consumer Reports, journals within clusters are approximately equal in importance. Nearly identical results were obtained when a t-test for independent samples (though these samples may not be truly independent) was performed, again working backward from the standard de- viations to obtain sums of squares and standard errors of the differences between each pair of means. Finally, confidence intervals for all means in the ARL directors' list were com- puted, again at the .05 significance level. For every journal, the confidence interval overlapped the one above it and below it. For example, the lower and upper limits for C&RL were 4.60 and4.87, respectively, while the lower and upper limits for LQ Research Notes 171 TABLE A SCHEFFE TEST FOR DIFFERENCES BETWEEN PAIRS AND CLUSTERS OF JOURNAL MEANS Journal Title Coil. & Res. Libr. Libr. Quart. J. Acad. Libr. Libr. Res. & Tech. Serv. Librat Trends Its. ech. and Libr. J SIS Library Journal Amencan Libraries RQ Special Libraries Libr. & Tf:.· Sci. Res. Collect. naffement Info. Proc. & gmnt. School Librak Journal Intern. Libr. ev. Microyra~hics Today Schoo Li rary Medta Q Intern. J. Law Libraries Law Library Journal *F(df: k- 1 = 19, N - k = 820), .05level = 1.57. Mean 4.7381 4.4048 4.3810 4.3810 4.2381 4.1429 4.0952 3.8571 3.5000 3.3810 3.1667 2.8810 2.5238 1.9286 1.7381 1.5714 1.5714 1.5714 1.5476 1.5238 Pair-wise Fvalue* 0.06 0.00 0.00 0.01 0.00 0.00 0.03 0.06 0.01 0.02 0.04 0.06 0.18 0.02 0.01 0.00 0.00 0.00 0.00 xxxx Possible Ousters+ Possible Cluster 1 Possible Cluster 2 The F value refers to pairs of titles: the title listed and the one immediately following . Thus, the first F listed, 0.06, refers to College and Research Libraries and Library Quarterly. F values must exceed 1.57 to be significant. None are. +Means for journals within " possible clusters" are not significantly different from each other. But the first title in cluster 1 (C&RL) is significantly different [F(.05) = 1.71] from the first title in cluster 2 (Library and Information Science Research) , while the last title in cluster 1 (Special Libraries) is significantly different (F = 1.71) from the last title in cluster 2 (Law Library Journal), clusters 1 and 2 overlap each other with Special Libraries. The difference between the average of cluster 1 and the average of cluster 2 is significant [F( .05) ,;, 22 .7] . were 4.09 and 4.72. Clearly, the upper limit of LQ falls well within the interval for C&RL, indicating that their means cannot be distinguished from each other. Visual inspection of the means for li- brary school deans' rankings (right column of table 1) suggests that few signif- icant differences would be found between adjacent journals in that list either. This analysis suggests that ranking av- erage ratings without submitting them to appropriate tests of significance cannot be trusted. Such tests are necessary even when data are trustworthy-for example, when the sample is large, or when it other- wise represents the population with a high degree of confidence. Here, a distinc- tion should be made between performing tests of significance to guard against sam- pling errors on the one hand and measure- ment errors on the other. Here, the rating scores can properly be considered as mea- surements subject to error. For example, an average score can hide a great diversity of opinion. If we ask 100 respondents to rate journals on a 1-to-5 scale, a particular journal could receive an average of 3.0 in several ways. At the extremes, all respon- dents could give the journal a rating of 3; or 50 respondents could give a rating of 1; and 50, a rating of 5. Both scenarios pro- duce an average of 3.0, but the first repre- sents exact consensus. In the second, the average score hides a considerable degree of measurement error. In fact, in the sec- ond scenario no individual respondent gives the journal a rating of 3.0 and we might well question whether a real con- sensus exists that a journal with a rating of 3.0 is really higher than one with a rating of2.9. Kohl and Davis sprinkle cautions throughout their study, noting that it has "important limitations" that must be con- sidered "to maintain a proper perspective on the findings." Perhaps the major cau- tion should address the use of these or similar ranks for determining tenure and promotion. If journal prestige and importance must be studied, then many related questions- including those raised here and by Kohl and Davis-must also be studied. Which journals do the larger population of non- 172 College & Research Libraries ARL directors and ACRL members feel are important? What is the relationship be- tween a respondent's own specialized area and the subject area of the journal be- ing rated? What are the correlates of "prestige" or "importance"? Can pres- tige or importance be predicted from other variables? What is the basis for equating prestige and importance? Is prestige a var- iable of real utility, or does it merely make an author feel good? Do studies of prestige contribute to the knowledge base of our profession? Or does the knowledge base contribute to prestige? Prestige is not a guarantee of quality, say Kohl and Davis. Likewise quality is not a guarantee of pres- March 1987 tige. Then what is quality, and what is the relationship between prestige and qual- ity? Kohl and Davis suggest citation analy- sis; other kinds of impact should also be examined. It seems that whenever we at- tempt to measure attitudinal variables, we can never really pin them down without reference to behavioral variables. Under- standing of behavioral variables has much the greater potential for contributing to good theory. In conclusion, whenever rating scores are used to produce rankings of items be- ing rated, those rankings should be sub- jected to appropriate tests of statistical sig- nificance. REFERENCES AND NOTES 1. William E. McGrath, "Predicting Book Circulation by Subject in a University Library," Collection Management 1, no.3/4:7-26 (Fall/Winter 1976-77). Average ratings in this research were for the vari- ables Hard/Soft, Pure/Applied, and Life/Nonlife. 2. David P. Kohl and Charles H . Davis, "Ratings of Journals by ARL Library Directors and Deans of "fi Library and Information Science Schools," C&RL 46:40-47 Gan. 1985). 3. All references to tables are to Kohl and Davis except for table A. 4. John T. Roscoe, Fundamental Research Statistics for the Behavioral Sciences (New York: Holt, 1975), p.313. Authors' Reply David F. Kohl and Charles H. Davis We read William McGrath's comments on our study with considerable interest. Our only concern is that in order to make his point he has to make us say more than we were, in fact, comfortable saying. It frankly never occurred to us that anyone . would take the listing in Table 1 as some kind of precise ranking where ''each mean is different,'' since that is obviousty not the case. Not only did a number of the journals listed in Table 1 have identical means C\nd were, in those cases, "ranked" in alphabetical order but in ad- dition we present two other possible "rankings" which vary in detail from the lists in Table 1. The point of the article, which was fairly explicitly made, was not that any one journal stood in a specific re- lationship to any other journal, but that a clearly recognizable general pattern did exist with some journals consistently emerging toward the top, others toward the middle, and others toward the bot- tom. David F. Kohl is Assistant Director for Public Services, Universi·ty of Colorado, Boulder, Colorado 80309. Charles H. Davf.s is Professor, Graduate School of Library and InfoTltUltion Science, University of Illinois, Ur- btlm~, Illinois 618()1 . Research Notes 173 In fact, Professor McGrath's own analy- sis seems to confirm this general hierarchy or, as he calls it, clustering. It should be noted that he finds this very general clus- tering (into two groups) using the Scheffe test-the most conservative test of this kind possible. A less restrictive test such as the Duncan, Tukey, etc., would invari- ably have suggested finer distinctions among the journals. The issue, which Mc- Grath's comments may obscure, is not whether there is or is not some hierarchy or ranked clustering but how fine the gra- dations of the hierarchy or clustering are. We agree with McGrath's point that av- erages don't necessarily constitute a de- tailed ranking and hope that his com- ments may help prevent a misreading of Table 1 of our study by casual readers. We do feel, however, that his misinterpreta- tion of Table 1 created a bit of a straw man in our case. BAIRRM® HAS/TAU! Over 1fitlll~~~eefintls, plus pateBts. books 111111 RIOI'e! With Biologic11/ Abstr11cts/RRM (Reports. Reviews, Meetings) you'll receive 250,000 entries for 1987 from over 9,000 serials and other publications from over 100 countries . No other reference publication provides you with comprehensive coverage of symposia papers, meet- ing abstracts, review publications, bibliographies, research communications. books, book chapters and U.S. patents . In three easy-to-use sections-Content Summaries, Books and Meetings. The indexes in each issue provide four modes of access to the literature : Author, Biosystematic, Generic and Subject. Take advantage of this excellent coverage of impor- tant new scientific research and discoveries for your library: Make sure ye11r library has it all! Subscribe today by contacting BioSciences Information Service (BIOSIS~) Customer Services, 2100 Arch Street. Philadelphia. PA 19103-1399 USA. Tele- phone (215) 587-4800 worldwide or toll free (800) 523-4806 (USA. except AK. HI, PA) . Or contact the Official Representative in your area . CRL3871HIA Swets ... an attractive, many facetted and transparent subscription service. We would be pleased to send you- our informative brochure as well as detailed documentation of our services.