<^ J 8061 'u m -m 'A 'N 'asnoBjAs SJ931EM so-ig pioiX^f) )unouJO)04d r.v.' m \^ SOME WELL-KNOWN MENTAL TESTS EVALUATED AND COMPARED BY DOROTHY RUTH MORGENTHAU Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, in the Faculty of Philosophy,! Columbia University t REPRINTED FROM ARCHIVES OF PSYCHOLOGY R. S. WOODWORTH, Editob No. 52 NEW YORK May, 1982 SOME WELL-KNOWN MENTAL TESTS EVALUATED AND COMPARED BY DOROTHY RUTH MORGENTHAU Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, in the Faculty of Philosophy, Columbia University REPRINTED FROM ARCHIVES OF PSYCHOLOGY No. 52 NEW YORK Mat, 1922 v\ cxcHAr^u^ TABLE OF CONTENTS Introduction 5 Subjects 8 Tests Briefly Described and Reasons for Their Selection 14 Method — Applying Tests to Subjects 21 General Considerations Specific Observations on the Application of the Test Selected Results 25 Conclusion 52 507 7G3 ACKNOWLEDGMENT. For the advice of Professor Edward Lee Thorndike of Teachers College, Columbia University, and of Dr. William Healy and Dr. Augusta Bronner of Judge Baker Foundation, Boston, the v^riter wishes to express appreciation. Special thanks are due for the painstaking assistance given by Pro- fessor Robert S. Woodworth of Columbia University, New York City. Some Weil-Known Mental Tests Evaluated and Compared ONE who approaches the subject of the measuring of children's mentality will find that the mind of the nor- mal child has received attention in what we may call vertical and parallel respects. There have been a considerable number of tests developed by students of psychology in the en- deavor to secure mental measurements independent of the experience and judgment of the clinician. The development has been in a vertical manner, that is, the best recognized psychologists who have undertaken this work have each de- veloped tests, have each put them into extensive practice and have published the results of that experience. But each of these psychologists has developed his test on his own suppo- sitions, and, basing the nature of his test on his own experi- ence, has tried to evolve a plan of testing which is supposed to be useful in determining mental conditions of such general ex- tent that they may roughly be called intelligence. Thus we have the Stanf ord-Binet scale, the most generally used of any one of the mental tests. Then there are the Porteus tests, the Pintner-Patterson performance scale, and a dozen or more of others which are known to every clinical psychologist. The development of mental tests has been parallel in that none of these psychologists in developing their own ideas have carried them to the point of thoroughly comparing the re- sults obtained by their tests to the results obtained by the simultaneous use of a number of the other tests all with re- spect to normal children. There has been some comparison of results of the various tests when applied to abnormal children but this has not been thoroughgoing and has been done not by making the tests with the idea of eventually combining the re- sults and of placing valuations upon them, but merely in the course of clinical work with abnormal children. It is question- able whether such results are sufi» "icntly thorough to be con- sidered the basis for a convincing answer as to the relative value of the respective tests, and inasmuch as they were made on abnormal minds, one would not dare to trust even those :6' -' '•': ''"'''-■' ^OME WELL-KNOWN MENTAL TESTS comparative results with respect to what the test will show as to normal minds. Those who have developed their respective tests have com- pared them with some other mental test, most frequently one of the Binet revisions. But, no considerable number of the tests which have been so developed in parallel fashion have been ap- plied purposely to obtain comparative results and to ascertain which if any of them can be shown to be untrustworthy and what group of them can be relied upon as furnishing a satis- factory schedule for testing and comparing the common ele- ments of mentality in normal children. Upon perceiving that there was a lack of any purposely made comparative study of mental tests it was proposed here- in to set forth the results of such a study of about a dozen of the most commonly used mental tests. The tests were applied to a large number of unselected normal children, in general each child receiving the full schedule of tests. By means of the results to be obtained from this comparative study it was an- ticipated : 1. That the degree of reliability of each test would be indi- cated. 2. That the same purposes could be effected with respect to the value of each test. 3. That the information obtained under the first and sec- ond headings would make it possible to select a schedule of tests of indicated reliability for application to normal minds, or further, whether the Stanf ord-Binet alone would suffice. 4. That by restricting the ages of children tested in general to from ten to sixteen years, the period in which individual capacities first assume importance for vocational determina- tion, it would be possible to guide the vocational training with some degree of success. A brief statement of the results can now be given reserving the more detailed statement involving the basis and methods for the results for future pages. The first aim, to secure an estimate of the reliability of the tests used, was largely suc- cessful. Of the thirteen tests, the reliability of which was investigated, one class, the four construction tests, Healy A and B. and Knox Moron tests, and diamond shaped frame, were found to be unreliable ; five other tests were found to be relia- ble, namely the Stanford-Binet, Pintner Non-Language group test, Thorndike Reading Scale Alpha 2, Porteus Maze test, and EVALUATED AND COMPARED 7 Tapping test; while the reliability of four tests, the Myers Mental Measure, Healy Pictorial Completion test II, Healy- Bronner learning tests, and the Crossline test was undeter- mined for various reasons. The results obtained as to value of the tests were as follows : Stanford-Binet, Pintner, Alpha II, and Porteus are valuable tests and should be included in individual case studies. In spite of their unmeasured reliability, Myers and Pictorial Completion II are also valuable tests and should likewise be included. Judgment should be suspended with regard to learning tests. The Tapping test is of doubtful, value and its use should be left to the discretion of the examiner. The Con- struction tests because of their unreliable character do not give valuable results. As to the schedule of tests to be used in testing normal minds it was found best not to use the Stanford-Binet alone but to have the schedule composed of that test and the five others which were found valuable. From the tests used and results obtained it cannot be stated here whether this schedule is of value as to vocational guidance for the reason that the factors involved in each test are not known with certainty and until they are known, definite valid conclusions about the abilities of the individuals concerned cannot be reached. SUBJECTS. It was desired to test one hundred normal but otherwise un- selected children. In order to obtain an unselected group it proved necessary to select the subjects very carefully, for, if all the children tested had been from a Children's Home, or from a Settlement, or from any one school, the result would have been a highly selected group. To avoid this a few were taken from many different sources and in this respect the dis- tribution proved to be reasonably satisfactory. As to age, originally the plan was to have about ten children at each of the ten periods of one year each, from seven to six- teen inclusive. But this plan was given up because our inter- est is not with the six or seven year old who has to go to school and learn fundamentals, no matter wherein his is gifted and who rarely shows talents or handicaps at such an early age. Our chief concern is with children in the sixth, seventh or eighth grades and in high school, because they are the adjust- ment problems, and because it is important to aid them if pos- sible in deciding whether they should remain in school or go to work. If the latter what should they do, if the former what sort of training do they need? So the attempt was made to lay all the emphasis here and reduce the number of children under eleven to a minimum. Another objection to the origin- al plan is that ten in a group is too small for any kind of gen- eralization. The total number tested was 128, of which 116 usable rec- ords were retained. For various reasons many of these rec- ords are incomplete so that this number was necessary in or- der to have a minimum of 100 scores on each test. There still remain some tests which were given to less than 100 children, but the number is in each case sufficiently large to give valu- able results. All defectives were excluded, for in mixing their records with those of normal children many confusions would have arisen, and the issues would have been less clear. Much in- tensive work has been done in testing defectives, so that we kr;cw a great deal about their reactions to a group of tests such as we have chosen. To be sure, they vary considerably in their results, but we know in general the points where they are weakest as in abstract reasoning and formal generalization, 8 EVALUATED AND COMPARED 9 and also the points in which proportionately they excel. By narrowing the field to normals the significance of the conclu- sions can be made more pertinent. This was an arbitrary pro- cedure dependent largely upon the judgment of the writer, and subject to criticism on this basis. It is quite possible that some very dull normals were also excluded, this being justified on the grounds that their normality might reasonably have been called in question by more severe examiners, With reference to the three cases whose I. Q.'s fall below 80, there seems to be no doubt that they are to be considered as dull normals. The grade they attained in school for their age, their response on the other tests and their behavior in the community all argue for including them in our study. The boy receiving the lowest I. Q. — 73 — was born in the United States but; taken to Italy at the age of five, and remained there six years. In spite of this he was in the eighth grade. He did very well with all the construction tests. As no limitations were set at the other end, the grade and I. Q. distributions are higher than one would otherwise expect in a general sampling of the population. I. Thirty-seven children, twelve girls and twenty-five boys, were tested at the Home for Jewish Children in Dorchester, Massachusetts. Many of these children were half orphans, some had lost both parents — most of them were in the Home temporarily. They were chosen from the total number entire- ly by chance. They all attended public school in the vicinity and all but two or three had come to the Home within two years. All were able to speak and understand English, this being the only language used at the institution, although in many of their homes no English was spoken. Their ages ranged from 7-0 to 15-1. II. Twenty-four girls came from Frances Willard Settle- ment in Boston, Massachusetts. These were divided into three clubs — one consisting of one seventh grade and ten eighth grade girls, the youngest being 12-7 and the oldest 14-2. They came one evening a week for the express purpose of taking the tests. They were the first ones to volunteer from a large group. The other two groups of seven and six respectively were younger girls who happened to meet on afternoons which were convenient for the examiner. III. Six high school girls in New York volunteered to take the tests. 10 SOME WELL-KNOWN MENTAL TESTS IV. The ninth grade consisting of six boys and five girls in the Woodmere School (private) at Woodmere, Long Island, were tested. The ages ranged from 13-1 to 15-2. V. The poorer section of the 8B class of Public School 11, New York, were tested. There were thirty boys in the class ranging in age from 13-2 to 16-11. VI. Finally eight miscellaneous children were tested. The subjects selected appeared to give a satisfactory differ- ence in quality so as to bring out the capacities of the tests to meet a variety of normal mental conditions. Yrs. 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 Mos. 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 TABLE Frequency 1 1 1 3 1 1 2 1 1 1 2 1 1 1 8 2 15 I AGE DISTRIBUTION 116 Yrs. Mos. Frequency 4 5 CASES 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 10 21 13 1 13 1 2 13 2 1 13 3 13 4 13 5 3 13 6 1 13 7 1 44 Yrs. 13 13 13 13 14 14 14 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 16 16 16 Mos. 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 Frequency 2 3 5 19 1 1 2 1 2 2 2 3 1 6 21 16 10 57 Distribution of the subjects by age. It will be noted that only 19 of the 116 subjects are under eleven years old. EVALUATED AND COMPARED 11 TABLE II Grade Distribution — 114 Cases. Grade I II III IV V VI VII VIII IX or I H. S. X or II H. S. XI or III H. S. XIIorlVH.S. * Left School VIII 1 II H. S. 1 The vast majority of subjects were in the Vlth to IXth grades in- clusive. 2 had left school. Frequency 1 2 2 11 7 16 10 44 12 4 5 INTELLIGENCE QUOTIENT DISTRIBUTION. 112 CASES. Scale: — 1 square to 1 child 70 means 70.000 to 79.999 etc. The curve of distribution is skewed positively. 100 uo 130 130 140 12 SOME WELL-KNOWN MENTAL TESTS TABLE III DISTRIBUTION OF INTRTJ.IGBNCE QUOTIENTS — 112 CASES. I. Q. Frequency I. Q. Frequency I. Q. Frequency 70 95 2 120 71 96 3 121 1 72 97 4 122 2 73 1 98 2 123 74 1 99 3 124 75 100 7 125 1 76 1 101 7 126 2 77 102 4 127 2 78 103 2 128 1 79 104 129 80 1 105 5 130 3 81 106 4 131 1 82 107 3 132 1 83 108 2 133 84 1 109 2 134 1 86 2 110 1 135 86 2 111 1 136 1 87 1 112 1 137 88 2 113 3 138 89 3 114 2 139 90 1 115 1 140 91 2 116 4 141 1 92 1 117 1 142 93 4 118 2 143 94 4 119 2 144 4 were not given the Stanford-Binet test. Average 104.5 Mental age in months Average 154.8 Mean Square Deviation 34.54 The table shows that very few of the children tested had I. Q.'s below normal. Age-Grade Distribution CHRONOLOGICAL AGE 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 Total cases I 1 1 II 2 2 III 1 1 IV 1 6 3 2 12 V 2 3 1 1 7 VI 6 10 16 VII 1 5 2 2 10 VIII 5 9 15 11 6 46 IX 7 3 1 11 X 1 2 2 5 XI 1 4 5 Total 3 2 8 6 10 21 19 22 15 10 116 EVALUATED AND COMPARED 13 CHRONOLOGICAL AGE Chronological age — mental age distribution. 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 Total 6.5 cases 7.5 1 1 8.5 2 1 1 1 5 9.5 3 1 1 1 6 10.5 1 2 3 2 2 10 11.5 1 1 3 1 6 12.5 1 1 3 4 2 3 1 15 13.5 4 1 4 1 10 14.5 2 3 2 6 6 3 22 15.5 2 4 8 14 16.5 3 4 2 1 1 11 17.5 2 1 1 4 18.5 3 1 1 5 19.5 1 2 3 Total 3 2 7 6 10 20 17 20 16 10 112 Mental age — grade distribution. I II III IV V VI VII VIII IX X XI 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 17.5 16.5 18.5 19.5 Total 11 7 17 1 6 5 15 2 11 1 1 1 4 5 1 2 1 44 12 Total cases 1 6 7 10 7 14 10 22 13 4 11 4 3 3 TESTS BRIEFLY DESCRIBED AND REASONS FOR THEIR SELECTION The large number of tests available had to be classified so as to find which tests covered identical ground; only one of these was then selected. Time M^as an element particularly to be recorded since preferably less than three hours should be de- voted to each child for the completion of all tests. This allot- ment of time is considered by most authorities to be generous, particularly since the Stanford-Binet takes nearly three quar- ters of an hour, thus leaving only tv^o hours for all the other tests. Consequently between alternate tests apparently serv- ing the same purpose the briefer one was chosen. The same limitation on the amount of time to be spent on any one indi- vidual caused the necessary omission of some tests which were highly desirable except as to their length. In the last men- tioned class are group tests requiring an hour or more to be applied. Where the results that are sought can be reached by group tests doubtless much time can be saved in using them, but the inquiries involved herein were such as to necessitate largely individual testing. In selecting the tests another danger that was realized and that it was attempted to avoid, although, as the results show, not with entire success, was that a great many tests involved many sides of mental activity so that the final result expressed numerically would not be indicative of which mental abilities had tested favorably and which unfavorably. For instance, ability to deal with abstract and with concrete material may be extreme opposites giving a correlation of minus 100. If both kinds of material are combined in one test, the child who succeeds in one may fail in the other and vice versa. In com- puting the final scores compensation will give the same net re- sult to two children of exactly opposite capabilities. If gen- eral intelligence is what we want we may find it in this way, but if we are interested in special abilities or disabilities these tests which hide them must not be used. We have found this confusion to exist in many tests, of course, never in such an extreme form as in the illustration above, and undoubtedly introduced on purpose, but we feel that its value is at all times questionable. This error is extremely difficult to eliminate 14 EVALUATED AND COMPARED 15 completely, in fact we can not be sure even now, as it will ap- pear later in the results, that we have successfully done so. Another source of error too often overlooked was borne in mind in selection of the tests, namely the variability of the test that is being considered. Where the same test is applied to a person at intervals and it is found that the resulting scores are not identical the question arises whether the vary- ing scores can be combined so as to give a reliable standard for use and comparison with the results obtained when the test is applied to other children, or whether the variation indicates an unreliability in the test itself sufficiently serious to warrant the test being discarded. As an example of variations of such minor character that their existence does not indicate unre- liability, and which can be compeijsated, we can take the tap- ping test where there may be a variation of about five taps in each direction from the average, which would be entirely sat- isfactory. Such variations are due to unessential and insig- nificant details of the conditions under which the test is re- peated, such as posture of the child being tested, kind of pen- cil or stylus being used, etc. Taking ten or more measures of tapping ability would increase the reliability but the final re- sults would show such slight difference from the result of one or two trials that the frequent repetition is entirely uncalled for to secure reasonable reliability. On the other hand, if the variations in result obtained by re- peated use of a test on the same individual are not of a minor character and if the day-to-day variability is so erratic that the variation is all the way from good performance to poor performance, then the situation is either that the child tested is shown to be subject to mental disturbance, or that the test itself shows a high and dangerous variability. If it is the test that is variable, it is obviously essential to weed it out ab initio. Such variability has been found to exist in the Knox cube test, in the application of which a uniformly normal child may make the record of an imbecile one day and of a super-normal child the next day. Of course, such a test, if not eliminated, would lead to results that are valueless for comparative pur- poses and dangerous for diagnostic ones. As to variability, the reliability of a number of tests was es- tablished and recorded before the study was undertaken. As to the remaining tests, in order to overcome the possible exist- ence of variations indicating unreliability it was necessary to 16 SOME WELL-KNOWN MENTAL TESTS retest each child with the same or with a similar test after an interval of a week — no less or practice effect would be met, no more to avoid the effect of any mental growth in the interval. The necessity of retesting caused by possible variability in the test itself, led to the subordinate but difficult problem of determining what methods of retesting would avoid errors due to the process itself. Thus, as has been mentioned, retest- ing must be done in such a manner as to avoid practice effect. It has been shown by various workers that certain types of tests once solved, such as most puzzles, are no longer tests at all, whereas others, such as auditory memory for digits and psychomotor control, show a minimum effect, which, after the week between tests, is negligible. Those of our tests which come within the last-mentioned class were similarly repeated. Those which were of the former type had similar tests substi- tuted for them in the second trial, while still others falling be- tween these classes were altered in details so that the same test could be repeated, avoiding the memory aspect. The tests finally selected were: 1. The Stanford revision of the Binet-Simon scale. This test is so widely known that it does not seem to be necessary to describe it here. 2. Pintner's mental survey non-language group test, with Myers Mental Measure as an alternate for repeating. These tests involved a minimum use of language. In the Pintner test no language is used in the performance, and in fact it is possible to give this test to foreigners or deaf children through the medium of signs, while in giving the Myers Mental Meas- ure it is necessary for the subject to understand simple lan- guage, but none is used in executing the test. The Pintner test has six parts, the first resembling the Knox cube test, the sec- ond and third being substitution tests, the fourth a drawing completion, while the fifth is a reversed drawing test, and the sixth a picture reconstruction. Following directions. Pictori- al Completion, and two tests of picking out objects with com- mon elements, compose the Myers test. 3. Thorndike's reading scale Alpha 2. This is a test in which language plays a prominent part. The subject reads a paragraph and then reads certain questions based upon the paragraph to which he writes his answer. To succeed he must understand the context of the paragraph, he must understand the question and know what it calls for, and he must be able EVALUATED AND COMPARED 17 to find the answer in the context and write it down. This is a graded test which is applicable from the second grade through high school. Since the practical work of this research was undertaken, Dr. McCall of Teachers College has considerably increased the usefulness of this test by devising ten sets iden- tical in method but with different contents, of which the test here used is one. It is now known as the Thomdike-McCall reading scale and its reliability has been thoroughly estab- lished. 4. Healy's Pictorial Completion Test B is an apperception test with the language element omitted. The ten pictures (plus one sample) present a day's activities of a young school boy, in which each picture contains a situation known to every child, such as eating breakfast, the school cloak room, a street accident, etc. In each picture one important element is lack- ing; pieces which complete the picture, plus fifty more of the same size being arranged in a definite order in a box from which the subject is at liberty to choose those which he desires. A clue to the missing piece is furnished by the pictures. 5. Porteus Maze Tests. Vineland Revision 1919. These tests are supposed to measure social fitness and common sense. Among the capacities which they were devised to measure are forethought and planning capacity, prudence and mental alertness in meeting a situation new to experience. There are eleven mazes, graded in difficulty from year three to fourteen. Beginning with year five, avoidance of blind alleys is the main requirement for a successful performance. The more complex the maze, the further ahead must one look in order to be cer- tain that one is choosing the correct path. There is no time limit; in fact no mention of speed is made, and if the child asks he is told to do it as well as possible, taking as long as he likes. Porteus says that children fail mainly because of im- pulsiveness in action, overconfidence and carelessness, lack of pre-consideration, lack of planning capacity, irresolution and mental confusion, inability to sustain attention, or to profit by past mistakes. 6. Tapping Tests — Healy's Form. This consists of a sheet containing one hundred and fifty half inch squares, arranged ten in a row — fifteen rows. The subject taps once in each square, vdthout touching the lines and covers as much ground as he can in thirty seconds. This is a simple test of psycho- motor control which was repeated without alteration. 18 SOME WELL-KNOWN MENTAL TESTS This test in a slightly different form was first introduced by Cattell in 1896, for testing freshmen at college. He had one hundred 1 cm. squares, into each of which the student must put a dot, completing the task as quickly as possible. Time was recorded ; evidently there were no errors. This test was supposed to measure rate of movement. Clark Wissler used it with many of Cattell's other tests in his "Correlation of Mental and Physical Tests" on college freshmen in 1901. He found that the average time for men was 34 seconds, for wom- en 30.8 seconds. In 1911 Whitley: (M. T. Whitley, An Em- pirical Study of Certain Tests for Individual Differences) re- ports results on Cattell's test, in which she kept the time constant (30 seconds) but computed the length of time which it would take to complete the blank. We have found the ad- ditional fifty squares useful in that some of our cases marked over one hundred squares in the thirty second time limit. 7. Healy's Construction Tests A. and B. The Knox-Moron test and Knox Modification of Healy A — a diamond-shaped frame, were used as alternates. We have called these A and B respectively to correspond with the Healy tests and for convenience. The equipment for these tests consists of a board containing one or more openings into which the child tested is supposed to fit pieces of wood so shaped that when properly arranged they will just close up the apertures. An advantage of these tests is the convenient size of the materials required. As all materials had to be carried from place to place the use of clumsy form boards or the tapping board with its dry bat- teries, metal plate and stylus, was practically out of the ques- tion. Where other things were equal, tests having the least paraphernalia were to be preferred. 8. The Crossline Tests shown in the figure were also given. The crossline tests were included because they are a modi- fication of the famous Code test, which is generally considered one of the best in the whole Stanf ord-Binet series. They take very little time to give and can easily be modified for repeti- tion. 9. Healy and Brenner Learning Tests. — These tests were de- vised to test learning ability, not as in the skill experiment, but as it is found essential in the elementary school subjects. Learning test A — the association of two symbols, a figure and a number, resembles other substitution tests such as those of Woodworth and Wells, Pintner, and especially Woolley. The EVALUATED AND COMPARED I. Crossline Test 19 II. Crossline Test 1 4 7 2 5 8 3 6 9 I 2 3 4 5 6 7 8 9 (c) (a) and (d) (c) are the forms used generally. (b) and (d) were used for retesting. difference lies in the fact that three trials were given and speed of learning determined success. Learning test B is the associa- tion of a symbol with a sound, as in learning a language. The sjmibols are from the Phoenician alphabet, and the sounds con- sist of one or two consonants and a vowel, simple enough to pronounce but without meaning. This prevents older children from forming associations which would be impossible for those who did not know the meaning of the syllables. Test C is the association of a symbol and a value presented audibly, and test D is the association of ideas with a picture. The first three test a sort of rote ability whereas the latter tests learning of ideas. It seems reasonable that success in school work may depend as largely upon learning ability as upon mental capacity, es- pecially in the early grades where the chief requirement in most of our schools is a good rote memory, as in learning mul- tiplication tables, and these two do not necessarily go together. Certain clinical cases bear out this suggestion, and these tests 20 SOME WELL-KNOWN MENTAL TESTS were included to ascertain the reactions of normal unselected children in this respect. National Intelligence tests were not yet published in Novem- ber and December, 1919, when this study was begun, or they would surely have been considered and very likely used. METHOD— APPLYING TESTS TO SUBJECTS General Considerations All of the tests except the non-language group tests and the Thorndike Alpha 2 were given to the subjects individually. The non-language tests were given sometimes individually and sometimes in groups of about ten with one exception where thirty eighth grade boys were tested in a group. The time of day at which the tests were given varied consid- erably. About fifty of the subjects, from the Children's Home and from the Settlement, were tested in the evening. All others were tested in the daytime. Care .was taken to avoid giving any tests while the subject might be fatigued. Each child was questioned regarding the matter and whenever there were indications of fatigue the testing was always postponed. Usually a subject was tested for only an hour and a half at one time; frequently the duration of the testing was shorter and only occasionally was it longer. The tests were all scored according to the directions laid down by their respective authors. They were all scored per- sonally by the examiner twice. In all of the tests selected for use the scoring is objective and requires no technique. Where possible, score cards or keys were used. Where the time taken by a subject to complete a test was to be recorded, the timing was done by means of a stop watch. Much effort was expended in persuading the subjects to give an equal amount of attention and concentration to all of the tests, so that the results would not be affected by individual preferences. For a large proportion of the subjects the incen- tive of vocational guidance was offered and some general voca- tional advice, based partly on the experience of the examiner as well as on the tests, was given at the conclusion of the test- ing. Younger children needed no incentive and their enthusi- asm was so pronounced that they continually applied to take more than the regular number of tests. Supplementary information concerning the subjects was gathered and recorded, especially age in months, school grade, success in school work, marks, standing in class, whether a 21 22 SOME WELL-KNOWN MENTAL TESTS repeater and how often, whether subject skipped any grades, etc. The vocational plans and interests of the older children were obtained whenever they had any. Results of physical ex- amination were obtainable for a large per cent of the cases. Several subjects had also been given neurological examina- tions. Occasionally some result can be explained by reference to these findings, as for instance an unaccountably poor per- formance on the Healy Pictorial Completion test which was probably due to uncorrected vision. One case where peculiar results were obtained from the tests was explained by the physical examination which showed a history of epilepsy and thereupon the case was no longer considered. Specific Observations on the Application of the Tests Selected Stanford-Binet. — In the United States there have been sev- eral revisions of the Binet-Simon test, the most recent and well the best of these being that by Professor Lewis M. Terman of Leland Stanford University, California, published in its final form in 1916. This revision, called the Stanford-Binet, was the one used in this study. The score obtained in the Stanford- Binet test is expressed in years and months, mental age. This mental age, when divided by the life age, results in the intel- ligence quotient, which is expressed as a decimal. There have been some wrongful uses of the intelligence quotient. It is an attractive but erroneous idea that a certain intelligence quo- tient can be found below which all can be considered feeble minded while all above are normal or supernormal. The error in this idea has been pointed out by Fernald, Mateer, Kohs, and others, who demonstrate the degree of overlapping, and show how valueless the I. Q. is when reported without refer- ence to life age. The Stanford-Binet results can be analyzed, as well as summed up in the I. Q., and it is possible that a detailed analy- sis of the data would yield all the information required. The plea that general intelligence scales have a right to be so called is largely based upon the supposition that the functions which are tested are manifold. Auditory memory for rote material and for ideas, visual memory, language ability, reasoning abil- ity, apperceptions, general information and many other abili- ties — all are found within the total range of tests. Unfor- EVALUATED AND COMPARED 23 tunately, in the Stanford scale no child gets tested in all these fields, and further, since they are not standardized separately the significance of success or failure in one part is difficult to determine. The Stanford-Binet tests were all given by the writer in the manner described by Terman. It is unnecessary to repeat this test in order to establish its reliability as the reliability has been independently reported upon by Terman. The vocabulary and memory span for digits of the Stanford- Binet were given with the Porteus tests, the remainder of the Stanford-Binet taking only one session. The Alpha 2 Reading Scale was scored by the method worked out by Kelley and his tables were used. The tapping test was scored for number of taps and errors. In the construction tests number of moves and time were taken and when the test was not completed within the limit of five minutes it was scored as a failure and the number of moves up to that time was noted. A construction test — once solved — is much easier to solve a second time unless the first solution was due to chance. Healy A was repeated in order to check the first performance. In Healy's construction test B a second trial generally brings a result as near perfect as possible (that is, dependent only on skill and speed in motor performances), even if the first solu- tion was hit upon by chance. It is impossible to do away with the chance element in performance tests, but in order to guard against it as much as possible, two tests were used each time, and the selection was made after a study of many types. There are several difficulties in making this choice and we were impressed by the fact that most performance tests have not been standardized and that there are very few tests of this kind which are sufficiently difficult for older subjects. The Healy and Knox tests satisfied both of these conditions. The scoring for the learning tests is rather complicated. A perfect score on all four tests is four hundred, one hundred being the perfect score for each test. Learning test A has twelve elements, and if these were all correct on the three trials, thirty-six elements would receive a mark of one hun- dred, or each would get 2.8. Thus the score equals the number correct multiplied by 2.8. When a perfect score is made on the first or second trial, it is assumed that further trials would give a perfect score also. In learning test B there are five 24 SOME WELL-KNOWN MENTAL TESTS symbols in each of the three trials, — consequently each re- ceives a value of 6.7. In test C there are seven symbols and three trials. Dividing one hundred by three times seven there results a value of 4.7 for each, vi^hile in test D, which has ten items, the total number is thirty, with a value of 3.3 each. The total for all the tests is the sum of the score on each of the four. RESULTS Where a clinician is generally satisfied to take the score ob- tained by applying- a test as a final goal, if in fact he goes so far as to work out a score, it is obvious that to attain the pur- poses here in mind the scores of the various tests used must be compared to gather statistics reflecting their qualities. That is, M^hen the one hundred and sixteen children had been given the tests that were selected and when the scores were recorded, the field work was completed, but there remained to investi- gate in a laboratory manner what a combination of the re- sults would show with reference to the purposes of this study. This comparison of results was made by correlation, that is, by measuring the mutual implications (see Thorndike, Mental and Social Measurements, pp. 156-185). A test is to be evalu- ated in three ways; its correlation with criteria other than results of tests; its self -correlation, and its correlation with other tests. In the present inquiry we obtained no outside criteria with which to correlate our tests, because no outside criteria available could be relied upon. In the field of mental abilities, the only criteria which have been widely used are teachers' opinions, school marks, etc. These are unsatisfac- tory at best. Although we possess all these data for our cases we consider them useless since the children attended eight dif- ferent schools in four places, with the marking systems vary- ing for each. We compared judgments as to intelligence made by the teacher of the ninth grade of the Woodmere school with those made by the eighth grade teacher of the New York Pub- lic School. In the former the I. Q.'s varied from 95 to 141 ; in the latter from 73 to 116. In the former all but two children tested as supernormal and the class average was 121, where- as in the latter only one tested above 110 with a class average of 96. But to read the teachers' judgments one would think that the pupils of the latter school were considerably more in- telligent than those of the former. Even the comparative rat- ings within one group were markedly unreliable. They showed all the errors of judgment pointed out by Terman. No account was taken of age; the best behaved, most conscientious pupil was invariably considered the most intelligent, etc. What is the use of making correlations with this kind of material, 25 26 SOME WELL-KNOWN MENTAL TESTS when one knows in advance that all the fault of a low correla- tion will be attributed to the criterion, and the tests will stand as before — unknown quantities! Moreover, these criteria could only be used to represent a measure of general intelli- gence. The teachers admittedly knew practically nothing about the special abilities of their pupils ; the parents, where consulted, knew very little more. A rating on general intelli- gence has been frequently correlated with general intelligence tests, and the results published. Our data would present no new factors. Consequently we have not evaluated the tests by means of correlation with outside criteria but we do have the data for self-correlations and for inter-correlations. Where various tests which we used intercorrelate extremely highly, we may feel that they are measuring the same thing. On the other hand, if the intercorrelations approach zero or are negative, the results indicate that we have no evidence that aspects of intelligence are being measured at all. Only if the correla- tions are sufficiently high to indicate that intelligence is being measured and low enough to show that different factors are entering into the different tests, can we consider the tests worthy of being included in mental examination. In judging our correlations we must remember that we are testing nor- mal children only, — ^therefore our coefficients are lowered — and that our ages do not cover a large area, which also lowers the coefficients of correlation. Our conclusions are limited to the tests we used but the gen- eral method of dealing with the scores has a wide applicability. Table IV DISTRIBUTION OF PINTNER SCORES — 100 CASES Score Frequency Score Frequency 200-209.9 370-379.9 3 210-219.9 380-389.9 5 220-229.9 1 390-399.9 2 230-239.9 400-409.9 4 240-249.9 410-419.9 5 250-259.9 1 420-429.9 5 260-269.9 430-439.9 7 270-279.9 1 440-449.9 4 280-289,9 1 450-459.9 5 290-299.9 2 460-469.9 7 300-309.9 1 470-479.9 5 310-319.9 1 480-489.9 1 320-329.9 2 490-499.9 4 330-339.9 4 500-509.9 6 340-349.9 2 510-519.9 4 EVALUATED AND COMPARED 27 Score 350-359.9 360-369.9 TABLE IV- Frequency 6 8 -Continued Score 520-529.9 530-539.9 16 were not given the Pintner Test. The evenness of distribution of scores is noticeable. Average=420.964. Unreliability 6.9. Mean Square Deviation=r68.96. Unreliability 4.9. Table V DISTRIBUTION OF MYERS SCORES — 90 CASES Frequency 1 2 Score Frequency Score Frequency Score Frequency 16 46 76 1 17 1 47 1 77 1 18 48 3 78 19 49 1 79 20 1 50 2 80 21 51 1 81 2 22 1 52 4 82 1 23 1 53 4 83 24 54 2 84 1 25 1 55 2 85 26 1 56 86 27 1 57 3 87 2 28 2 58 2 88 29 2 59 2 89 30 60 2 90 31 1 61 1 91 1 32 62 3 92 33 2 63 3 93 34 2 64 3 94 1 35 65 2 95 36 3 66 96 37 1 67 97 38 1 68 2 98 39 69 3 99 40 1 70 1 100 41 1 71 1 101 42 72 1 102 43 4 73 1 103 1 44 74 104 45 2 75 105 26 were not given test Average=53,325 Unreliability 1.8. Mean Square Deviation^ il7.88. Unreliability 1.3. Table VI DISTRIBUTION OF ALPHA SCORES — 107 CASES Score Frequency Score Frequency Score Frequency 3.6 2 5.4 1 7.3 7 3.7 5.5 1 7.4 6 3.8 5.6 1 7.5 12 3.9 5.7 1 7.6 2 4.0 5.8 7.7 6 4.1 3 5.9 2 7.8 1 4.2 1 6.0 7.9 2 4.3 6.15 1 8.0 2 28 SOME WELL-KNOWN MENTAL TESTS Table VI — Continued Score Frequency Score Frequency Score Frequency 4.4 6.2 3 8.1 1 4.66 1 6.3 8.2 2 4.6 6.4 2 8.3 2 4.7 3 6.5 1 8.4 1 4.8 1 6.6 3 8.5 2 4.9 2 6.7 6 8.6 6.0 1 6.8 4 8.7 5.1 3 6.9 3 8.8 1 5.2 6 7.0 1 8.9 5.3 7.1 7.2 7 9.0 1 9 were not given The Alpha Test. Average=6.834. Unreliability .116. Mean Square Deviation: =1.20. Unreliability .082. TABLE VII DISTRIBUTION OF PICTORIAL COMPLETION TEST SCORES — 110 CASES Score Frequency Score Frequency Score Frequency ^15 to 2 30 to 34.99 6 65 to 69.99 12 Oto +5 2 35 to 39.99 6 70 to 74.99 7 5 to 9.99 2 40 to 44.99 6 75 to 79.99 4 10 to 14.99 1 45 to 49.99 10 80 to 84.99 9 15 to 19.99 4 50 to 54.99 8 85 to 89.99 4 20 to 24.99 4 55 to 59.99 12 90 to 94.99 2 25 to 29.99 60 to 64.99 8 95 to 99.99 1 6 were not given test. Average=54.527. Mean Square Deviation=22.69. Unreliability 2.16 Unreliability 1.5. Score 170-179.9 180-189.9 190-199.9 200-209.9 210-219.9 220-229.9 230-239.9 240-249.9 LEARNING Frequency 2 1 2 1 4 1 TABLE VIII TESTS DISTRIBUTION — 106 CASES Score 250-259.9 260-269.9 270-279.9 280-289.9 290-299.9 300-309.9 310-319.9 320-329.9 Frequency Score Frequency 4 7 6 5 8 10 330-339.9 340-349.9 350-359.9 360-369.9 370-379.9 380-389.9 390-399.9 400-409.9 10 were not given the Average=300.66. Mean Square Deviation= tests; 3 none at all; 7 not all four. Unreliability=5.08. :52.31. Unreliability=3.6. TABLE IX PORTEUS SCORES DISTRIBUTION — •113 CASES Score Frequency Score Frequency Score Frequency 5 2 8.5 4 11.5 14 5.6 1 9 2 12 7 6 9.5 6 12.5 15 6.5 1 10 6 13 15 7 3 10.5 8 13.5 4 7.5 4 11 11 14 7 8 3 3 were not ] give this test. Avera; ge=11.09. Unreliability .19. Mean Square Deviatioi] 1=2.02. Unreliability .13. EVALUATED AND COMPARED 29 TABLE X DISTRIBUTION OP CROSSLINE TEST SCORES — 114 CASES Score Frequency Score Frequency Score Frequency I II I II I II Both OK' 70 OK'-OK' 1 OK*-F 3 OK'-OK' 13 OK'-OK' 2 OK'-F 1 OK'-OK> 5 OK'-OK" 1 OK'-F 2 OK'-OK* 5 OK*-OK* 2 OK*-F 1 OK'-OK' 2 OK=-OK* 1 F -F 5 OK'=Correct on first trial. OK'=Correct on second trial. F=Failure on fourth trial. TABLE XI DISTRIBUTION OF TAPPING SCORES. AVERAGE OP 2 TRIALS — 113 CASES Score Frequency Score Frequency Score Frequency 40 to 44.99 1 65 to 69.99 11 90 to 94.99 5 45 to 49.99 3 70 to 74.99 13 95 to 99.^9 5 50 to 54.99 5 75 to 79.99 23 100 to 104.99 2 55 to 59.99 9 80 to 84.99 20 105 to 109.99 60 to 64.99 9 85 to 89.99 6 110 to 114.99 1 Average=73.43. Unreliability 1.26. Mean Square Deviation=13.39 Unreliability .89. TABLE XII DISTRIBUTION OP CONSTRUCTION AND KNOX — TIME 108 CASES Score Frequency Score Frequency Score Frequency 50 to 99.99 1 350 to 399.99 12 650 to 699.99 6 100 to 149.99 7 400 to 449.99 8 700 to 749.99 4 150 to 199.99 5 450 to 499.99 8 750 to 799.99 2 200 to 249.99 10 500 to 549.99 9 800 to 849.99 3 250 to 299.99 10 550 to 599.99 5 850 to 899.99 2 300 to 349.99 8 600 to 649.99 7 900 to 949.99 950 to 999.99 1 Averagez=420 to 480 or 7.685. Mean Square Deviation=3.39. Tables 4 to 12 inclusive show the distribution of scores on the various tests. The average, or more properly speaking the arithmetic mean and mean square deviation, are also given for each. That we have sufficient cases is shown by the relation of the variability to the average. In only a few instances is it large enough to raise a doubt as to whether enough cases were used. These are the Pictorial Completion test, the Construction tests, and the Myers Mental Measure. The formula for the unreli- adis. ability of an average is aT-obt.av.=: for the unreliability vn adis. of a mean square deviation it is o-T-obtcrrr: These data \/2ir are also included in the tables. (See Thorndike, Mental and Social Measurements). 30 SOME WELL-KNOWN MENTAL TESTS A few special considerations arose at once with reference to the crossline tests, the tapping test and the construction tests. The crossline test has no value for our subjects (see table X.) ; one hundred and fourteen cases were tested, out of which 70, or over 60 per cent, made perfect scores ; the remain- ing 40 per cent ranging almost indifferently from one error to complete failure. This test is, then, far too easy for our sub- jects, and the results are useless for our purposes. We will disregard it completely from now on. In dealing with the tapping test we were confronted with the problem of how to handle the errors. Since a perfect cor- relation would be expected between two absolutely perfect tests of tapping ability, the highest correlation obtainable is pre- sumably the one which best accounts for the errors. On this assumption the two trials of fifty cases of the tapping test were correlated both by Pearson and Spearman formulae, first disregarding the errors, then weighting them one each, and finally weighting them two points each, with the following results : Pearson Spearman Erors disregarded r = .794 r = .917 Errors weighted one each r = .773 r = .90 Errors weighted two each r = .764 r = .82 It would seem then that the errors are of comparatively little importance, but as disregarding them gives the highest self-correlation, they will be omitted in any correlations in which the tapping test is involved. A similar problem is presented by the construction tests, where we have scores for time and moves: Should they be combined and if so, how? If not, are they both important, or only one, and if the latter, which one? In order to arrive at an unbiased conclusion — for it was the writer's opinion that time was by far the most valuable measure — the advice of fifteen other persons was sought. These others were all familiar with the tests, and had used them extensively in clinical work. By far the majority were in favor of using both time and moves, each independently of the other. Two of these considered the moves decidedly more important than time ; two others stated that time alone was sufficient, because time and moves had been found to correlate so highly, that the difference between using them and not doing so was within EVALUATED AND COMPARED 31 the probable error of either one. None recommended at- tempting to combine them. The following correlations were therefore made : Construction A with B-time. Construction A withB-moves. Knox A with B-time. Knox A with B-moves. Average Construction A and B with average Knox A and B- time. Average Construction A and B with average Knox A and B- moves. If the test was not completed in five minutes it was scored as a failure and the number of moves up to that time re- corded. Some children who solve the test in three minutes make more moves than others who fail in five minutes. How can one tell how many moves the latter would have made, had they completed the test? Obviously, the number they made until they were arbitrarily stopped is not a fair measure. It was finally decided to omit all cases where any construction test was a failure, from the moves correlations. The crude scores were not used in the time correlations, but the three hundred seconds were divided into twenty groups of fifteen seconds each. Anyone succeeding with a test in fifteen seconds or less, was put in group one; if he took more than fifteen seconds and less than thirty-one seconds he was put in group two, and so forth. All who failed the test were put in group twenty, thus making it possible to include in these correlations many cases which had to be excluded from the correlations of number of moves made. Taking up first the self-correlations, that is, the correla- tions of our alternate tests, with each other or the correlations of the scores obtained by repeated use of the same test, the results were as follows: As only a few Stanford-Binet's were repeated, the results are of little significance. We obtain a correlation of .89 on our fourteen cases. L. M. Terman ("The Intelligence of School Children" ch. IX.) had retests given to three hun- dred and fifteen children, out of which forty-six were given three or more tests. The interval between the first and sec- ond testing ranged from one day to seven years. The central tendency of change is represented by an increase of 1.7 in 32 SOME WELL-KNOWN MENTAL TESTS I.Q. ; the middle fifty per cent of change lies between the limits of 3.3 decrease and 5.7 increase. Consequently the probable error of a prediction based on the first test is 4.5 points in terms of I.Q. The correlation between all the testings is .933. Apparently whether the interval be a few months or several years does not influence the result. If the re-examination be within a few days, the I. Q. will — on the average — be raised only two or three points, and this when no restriction has been put on the children communicating with one another. There are several exceptions to this general rule, one being that young feeble-minded children tend to show their feeble- mindedness more as they grow older ; that is, they test lower on the Stanford-Binet. We need not concern ourselves with this, as only normal children were included in this study. Another obvious factor which tends to make the I. Q. appear unstable, is due to the fact that the test is limited at the upper end. As a child with a high I. Q. grows older, the I. Q. drops until at the age of sixteen years the highest I. Q. obtainable is 122. In many pathological cases such as children suffering from epilepsy, chorea, etc., the I. Q. fluctuates considerably. But even within the ranges of normality, Terman thinks that fluctuations occur for at least three reasons. 1. There may be a certain amount of irregularity in the actual rate of mental development. 2. The results of a test may be influenced to some extent by the conditions under which it is given, the state of the child's health, his attitude toward the test, fatigue and other temporary and accidental features. Retests after a brief interval indicate that errors from this source are ordinarily not large. 3. There is inevitably a certain amount of error in every I. Q. rating due to imperfections in the scale used. What has been generally criticized in the Stanford-Binet scale, namely that it measures different things at different years and consequently that a subject might do very well when his memory ability for example was tested, and very poorly when his reasoning ability came into the foreground a couple of years later, does not seem to be valid on actual findings. The theoretical argument against such a criticism is that so many age levels are tested each time that a subject will win and lose points in every branch which the test in- cludes. EVALUATED AND COMPARED 33 The Pintner and Myers tests were chosen to measure the same thing, and so we expected to find a high correlation be- tween them. The Pearson coefficient of .584 was so unex- pected that we felt that further investigation was needed. A closer study of the tests revealed the fact that their likeness rested on negative similarity; neither involved the use of language, but in other respects they apparently required dif- ferent abilities. The Pintner test appeared more limited, more mathematical, involving concrete situations rather than gen- eralizations while the Myers on the other hand was more general, but rather sketchy. In order to test the truth of this hypothesis, the six Pintner tests were intercorrelated and also the four Myers tests — see table. The average of the Pintner intercorrelations was .234, of the four Myers tests correlated each with all the others, .445. It will therefore be seen that the above explanation is unsatisfactory. TABLE XIII PINTNER TESTS INTERCORRELATED 1 2 3 4 5 G Composite 1 -.009 .392 .325 .183 .337 .618 2 -.009 .470 .022 -.022 .107 .396 3 .392 .470 .361 .224 .316 .757 4 .325 .022 .361 .035 .456 .625 5 .183 -.022 .224 .035 .307 .540 6 .337 .107 .316 .456 .307 .670 Average .249 .126 .353 .240 .154 .305 Composite .618 .396 .757 .625 .540 .670 Average of all above correlations, regarding signs + .234. Average of all above correlations, without regarding signs + .238. Probable Error of each correlation, approximately .05. Number of cases 100. The fact that correlations between the separate Tests are low, while those of each Test with the composite of all 6, are high, indicates merit in the Test as a whole. TABLE XIV MYERS MENTAL MEASURE INTERCORRELATIONS 12 3 4 Composite Average of all above correlations — + .445 1 .470 .469 .564 .786 Number of cases— 89 2 .470 .346 .424 .796 3 .469 .346 .403 .477 4 .564 .424 .403 .775 Composite .786 .796 .477 .775 The comment made concerning the previous Table — Pintner Tests — applies to some extent to the Myers Test also. However, the correla- tions between the separate tests are much higher than those found be- tween the Pintner Tests. 34 SOME WELL-KNOWN MENTAL TESTS For if the Pintner tests were all of the same nature, includ- ing the same factors, their intercorrelations would be high. On the other hand, if the Myers tests were general, their inter-correlations would be lowered. Just the opposite occurs ; the Pintner intercorrelations are lower than the Myers. These correlations can probably be explained on another basis. In the Pintner series certain tests are easier than others, most esspecially the second and fourth, which lowers the intercor- relations. In the Myers Mental Measure all the tests with the exception of the third are of about the same difficulty, the grading being within the test, and this raises the correlations. It also seems probable that while the Pintner tests do measure more limited factors, each test may measure a dif- ferent one, the type of material alone remaining the same. On a priori grounds something of this sort seems likely, for the material is practically the same, the correlations are low, so the factors measured must be different. It is true that in the Myers Mental Measure the ability to respond to the spoken word (directions) is part of the test, and it is possible that this is a special ability — calling forth something akin to the abilities necessary for success with the Stanford-Binet, even where the language itself is easily under- stood. Such a factor our data are unable to measure, but it is interesting in this connection to compare the correlation of the Stanford-Binet with Pintner and of the Stanford-Binet with Myers. In devising the Pintner non-language test, the effort was made to have it extend from the lowest to the highest grades. This meant introducing tests such as the second, which is far too easy for a child after he has reached the fourth or fifth grade, and also others which were almost incompre- hensible to the young child, as tests four and six. Since our subjects are for the most part past the fourth or fifth school grade, we would expect to find some sign of their maturity in the correlations. Reference to the table shows that test two correlates lower with all the other tests than any other single test. The one exception is the correlation of tests two and three, which — it will be remembered — are identical in form, the latter being different from the former only in degree of difficulty. Test six, on the other hand, correlates higher with every other test, than test two. This is as it .should be: had we tested younger children the table would EVALUATED AND COMPARED 35 probably have shown entirely different results. Incidentally, these findings show the importance of bearing in mind the nature of the group that is being studied when interpreting correlations. Each part was also correlated with the total test score, with high results throughout, with the exception of test 2. In looking at this table, one must feel that the test is a good one, for the intercorrelations of the separate tests are low, but with the composite they are high. Before leaving the Pintner test, mention should be made of a study by Jeanette Chase Reamer, in which she retested over four hundred children with this test with slightly less than a two-year interval, and found a correlation of .726 between the relative positions which they occupied at each testing. The closeness of this correlation was a complete sur- prise to both her and to Professor Pintner. With regard to the Myers Mental Measure intercorrelations, we find them all fairly high and regular. The most surprising thing about them is that tests three and four which appear far more similar than any other two tests in this series, should have one of the lowest correlations, — lower than four with one or four with two. Also, we see no reason why one and four should correlate higher than any other two. If the language factor were significant, we should find one and three (where audible directions must be followed for each separate unit of the test) correlating highly, and also two and four (where after the original directions the subject is left to him- self). But as a matter of fact, one and two, and one and four are higher than one and three, and two and four. How- ever the degree of difference between the various correla- tions is so small that these comparisons must be taken in a negative rather than a positive sense; that is, we might have expected the correlations to prove something, instead of which they prove nothing! With the composite the separate tests correlate very highly, as would be expected since the com- posite includes always the test being correlated with it, thus giving a perfect relation between two out of the five factors. Test three proves an exception here also, and we feel that the fault is the same as with Pintner two: it is too easy for our subjects. The Porteus Maze Test when correlated with itself gives a correlation of .95, which is high and satisfactory. The intercorrelations of the construction tests gave the 36 SOME WELL-KNOWN MENTAL TESTS most disconcerting results of all. They seem to prove Pro- fessor Thorndike's assertion that no matter how many con- struction tests are used, one cannot do away with the chance element. If four construction tests, when correlated for time and moves, give only .16 for the former and .08 for the latter, it seems like a hopeless task to give sufficient tests to raise the correlation to the high 70's or 80's. This is indeed a problem, for the construction test as such is undoubtedly de- sirable. Perhaps more important than these low numerical results, is the fact that combining the individual tests does not seem to operate to raise the correlations. Thus the two Healy tests when correlated for time, give a result of .21, the two Knox tests similarly correlated give .27, but the average of Healy tests with average of Knox tests shows a correlation of only .16. In attempting to explain these findings, it must be re- membered that the Healy tests were given on one occasion, and the Knox tests at least one week later, both A and B on the same day. If our results were due primarily to lack of reliability of the construction tests from day to day then scores from two construction tests given on the same day ought to show higher correlations than we found. If the lack of reliability of con- struction tests from day to day is not to be considered because of the generally low correlations, and so if it makes no dif- ference whether all four tests are given on the same day or different days, then our average correlations should not turn out to be lower than the correlations of tests given on the same day because an increase in the number of factors gen- erally operates to raise the correlations. Other correlations between various combinations of the construction tests were made and are recorded in table 16, but the results are no more enlightening than the ones we have discussed here. We are assuming here that the solution of each of the four construction tests involves the same abilities, not that they are of equal difficulty. We have no evidence to prove that this is the case, but we do not see how any construction tests could be devised which, though different, were apparently more similar than these. However, Healy test A and Knox test B are more similar than any other combination. Knox's test was modelled directly from Healy's and is supposed to be more difficult. A correlation between these two gives us EVALUATED AND COMPARED 37 minus .055 for time and .126 for moves. In other words, the correlation between the two is about what one would obtain be- tween two factors having no relationship to each other at all. If this is true of Healy Construction test A and the Knox diamond-shaped frame test, we conclude that construction tests have no constant value for intelligence testing. It will be recalled that the Thorndike Reading Scale, Alpha 2, was not repeated as the alternate scales now available had not yet been published, but we might quote Dr. McCall's state- ment to the effect that a high correlation was obtained be- tween our scale and the more recently devised alternate on representative subjects. No correlation was obtained between two trials of the Healy Pictorial Completion test II because the number of cases who were retested was small, about twenty-five, no fur- ther retesting being done because the attitude of the subjects was so different on the second testing that the repetition was more a matter of memory than anything else. In repeating this test a week or more after the first pres- entation it was found that the correct pieces were again put in. Of those that were incorrect about half were the same, the other half being pictures having about the same value in scoring, so that the total score was very little altered. It was generally slightly increased, rarely lowered. The only other test of this kind available is Healy's Pictorial Com- pletion test A, which is so simple that almost all of our sub- jects would make perfect scores on it. The attitude of the sub- jects, when this test was offered a second time, was not good. The test appeals because it is a new situation presenting a problem in an attractive form. The second time, the newness has worn off. The usual response is, "I've done that before," or words to that effect. If the child is urged to attempt a better performance, he will often ask in a surprised tone of voice, "Didn't I do it perfectly before?" Even when one suc- ceeds in getting a child to try again, he rarely makes any effort, but puts in at once the pieces selected before or similar ones. If he comments audibly on his performance, it runs something like this, "Oh, that one, — a book was missing there ; where is it? Here — why there are two — well it doesn't matter, it's a book he dropped." Occasionally one will notice that it does matter, but even this is due largely to chance, to his happening to have spied two books this time. 38 SOME WELL-KNOWN MENTAL TESTS As a whole it was felt that what was gained by repeating this test in the way of establishing its reliability, was not equivalent to what was lost in the attitude of the subjects to the tests as a whole. If repeated at the very end of the test- ing, this difficulty would be in part eliminated, but it was decided to omit its repetition completely. The reliability of the Healy and Bronner learning tests was not ascertainable as no alternate series has been devised, and as no other test could be found which appeared sufficiently similar to warrant the hypothesis that it measured the same thing. The tapping test was repeated in exactly the same form, and showed an intercorrelation of .81 with a P. E. of .022. This we may consider a satisfactory correlation, showing that the test has a high degree of reliability. The self-correlations having been thus completed and ana- lyzed, the next step is to consider the intercorrelations of the tests. Let us now consider the correlation of each test with the Stanford-Binet mental age. As all the tests are given crude scores regardless of age, in order to have comparable data, the mental age must be used instead of the I. Q. Number of Cases 97 89 106 110 105 110 112 107 83 110 88 110 88 111 103 111 103 Total number of tests correlated with Stanford-Binet r = .5976. Woodworth's method of combining the results of M Av.S^-1 several tests used, Av. r= — (Woodworth: Combin- ing the Results of Several Tests). The first column stands for the Pearson coefficient obtained TABLE XV Stanford-Binet (probable error) Pintner .439 ±.055 Myers .686 ±.037 Alpha 2 .757 ±.027 P. C. II .541 ±.045 Learning 491 ±.049 Porteus .536 ±.045 Tapping .604 ±.040 4 Construction .426 (Time) ±.078 4 Construction .326 (moves) ±.097 Healy A .410 (Time) ±.079 Healy A .374 (moves) ±.092 Healy B .281 (Time) ±.088 Healy B .088 (moves) ±.105 Knox A .046 (Time) ±.095 Knox A .009 (moves) ±.099 Knox B .216 (Time) ±.090 Knox B .112 (moves) ±.098 EVALUATED AND COMPARED Z9 2 (x.y) from the formula r = ^^f — a — 5 or, as it is usually stated, ^ — -^ V - ^ rj^Yie P, E. in this case means the probable divergence of the true coefficient of correlation from that obtained from a limited random selection of cases. The for- l-r== mula was was aT-obt. r= . If the median deviation of the probable divergence is desired it may be obtained by multi- plying the figures in the second column by.6754. For a dis- cussion of these formulae and any other statistical methods here used, see E. L. Thorndike, Mental and Social Measur- ments. It is interesting to find the Alpha test correlating most closely with the Standford-Binet of all the tests used. It cor- roborates to some extent the current opinion that the Stan- ford-Binet is largely a test of language ability. The next highest correlation, that of the Myers Mental Measure, is more difficult to explain. Although intended as a non-language group intelligence test, it involved more language than any of the other tests employed. Still it would seem surprising if this were such a tremendous factor. It tends to indicate the validity of group tests as does also Alpha, in that these two tests were given to nearly all of the subjects in groups, and yet correlate more highly with the individual Stanford-Binet than any of the other tests do, practically all of which were given individually. One of the most surprising results is the high correlation of tapping with the Stanford-Binet. One would generally assume that the type of motor ability required in our tapping test had little to do with intelligence — especially with older subjects. Our data apparently contradict this hypothesis, and we are confronted with the necessity of explaining the data. It is known that tapping ability increases with chrono- logical age at least up to maturity in the absence of tremors, epilepsy, chorea and other diseases affecting the co-ordinat- ing mechanisms. When we consider that our subjects were all normal and therefor^ their mental ages tended to increase with their chronological ages, and that all our cases were treated together regardless of age, it at once seems plausible that we have here a spurious correlation due to increase in both scores with chronological age, rather than intelligence. 40 SOME WELL-KNOWN MENTAL TESTS We have therefore correlated tapping with the Stanf ord-Binet I. Q.'s, which represent intelligence regardless of age, the coefficient obtained being .069, and find that our assumption is justified. The correlations between the construction tests and Stanford-Binet are very low, the only reasonably high coefficient being obtained with Healy A. This correlation was about the same as the composite of construction tests with Stanford-Binet. It is interesting to note that Healy A was the only construction test which Professor Terman used in his revision inasmuch as he considered that one only to meet the requirements sufficiently to be included. The remaining tests, Pintner, Porteus, Learning, and P. C. 2, that have been correlated with Stanford-Binet, each show a correlation very close to .50. This we consider signifi- cant in that they are sufficiently high to show that we are measuring intelligence, restricting that term to its generally used meaning with reference to mental testing. In addition the coefficients of correlation are low enough so that we may conclude that different abilities of the subjects tested are being measured, that is, the use of different tests does not result in a repeated measurement of the same abilities. Con- sequently the use of these tests in addition to the Stanford- Binet, means the measurement of more varieties of ability than can be tested by the Stanford-Binet alone. It remains to be determined whether Pintner, Porteus, Learning, and P. C. 2 all measure the same factor or whether some if not all of them can be used to distinguish special abilities which the others do not test. The answer to this inquiry lies in the results obtainable by the correlation of all of these tests with each other. These results are recorded in Table 16 and they are results so unexpected that they call for interpretation. Many of the correlations recorded in the table show that the importance of the language factor has been overestimated in dealing with older school children. The correlation between language and non-language tests are high enough to show that the language factor need not be avoided to have a test which can be said to measure intelligence. Let us first consider the correlation between Myers and Alpha 2; the former is supposed to be a non-language test, the latter a test of understanding of sentences. If language were an important factor it would be hard to account for a EVALUATED AND COMPARED 41 correlation of .733, the second highest obtained aside from the self-correlations. P. C. II is a performance test dealing with pictures ; it is concrete where the Alpha 2 deals with abstract ideas, — yet these two give a correlation of .709 — likewise un- questionably high. We have stated elsewhere that some language enters into the Myers Mental Measure, and that ability to respond to the spoken word may be an exceedingly important factor. If so, why does P. C. II give a coefficient of .714 with Myers Mental Measure? If the Myers Mental Measure is a non-language test, why is the correlation of Pintner, a thorough going non-language test, with Myers lower than Pintner with Alpha 2, a test involving so much language? Again, when we compare the correlation of Alpha 2 and Porteus— .701— with that of P. C. II and Porteus— .702 — we are at a loss to explain the similarity in result unless we discard the idea of the importance of the language factor. For the Porteus test requires no language. One must not overlook the importance of language as a handicap in giving tests to foreigners, etc., but where older school children are being tested it cannot be vital. For in order to succeed in the higher grammar school grades, it is essential that they have a fairly good working knowledge of the English language, and this is all that is needed to suc- ceed with the so-called language tests. Coming to the selection of a schedule of tests we conclude that: 1. Reading scale Alpha 2 should be included. In the first place because of its high correlations with other tests, the highest of any test with all the others, and also because of special considerations. It must be remembered that Alpha 2 is entirely a reading and writing test and therefore one would not expect so uniformly high a correlation as exists between it and the other tests which are supposed each to be specially adapted toward bringing out certain abilities. The high correlations remind us of Binet's constant contention that intelligence, broadly speaking, can be tested by language tests. This conclusion, however, does not imply that a non-language test cannot serve a like purpose. Our intercorrelations show that there was no reason to avoid language tests inasmuch as they correlated highly with the non-language tests. This is interesting in that the tendency in devising tests is towards making them language tests. For causes of this tendency 42 SOME WELL-KNOWN MENTAL TESTS we can ascribe first, the simplicity and lack of apparatus inherent in them and second, that the difficulty or ease of the test is far more readily regulated than in the non-language tests. In scrutinizing a test to forecast the results of its use our results seem to show that it is not necessary to dwell upon whether or not the tests involve the use of language. 2. The list of tests selected includes both Myers and Pintner. Myers correlates more highly with every other test than does Pintner, with the exception of their respective cor- relations with Porteus. As has been stated, both of these tests are valuable and in addition they have the merit of yielding different results. 3. The intercorrelations of P. C. II with such of the other tests as we found to be reliable were sufficiently high to make us believe that this test should be included in our schedule in spite of the uncertainty as to whether it is reliable. 4. Definite judgment upon the learning tests should be reserved for the present. Their highest correlation is un- der 50 and we have no evidence that they are reliable. Before the learning tests have a right to be so called they must be shown really to measure learning ability; they must also be tested for reliability. It is a question whether learning ability of a given individual is uniform in all fields. The uniformity of learning ability cannot be assumed, for a mere assumption as to the uniformity of motor ability proved to be wrong (See Perrin, An Experimental Study of Motor Ability). If learning ability is found not to be so, combining the various tests may operate to conceal what is valuable in them. 5. The tapping test should be included, in the discretion of the examiner, not because the results can be relied upon to indicate intelligence but because giving this test, which takes only a minute, may disclose latent defects in motor control. In an attempt to reach some definite conclusion about the construction tests we have made many correlations of different combinations. Healy A is the only test which gives a corre- lation as high as .40 and that with the Stanford-Binet. Nor does combining the tests raise the correlations, for the four construction tests correlated with Stanford-Binet give a result practically no higher than Healy A alone. It seemed useless to correlate the construction tests with the other tests when they gave such unsatisfactory results with each other, and with the Stanford-Binet. We have no evidence that these four EVALUATED AND COMPARED 43 tests are valuable either as intelligence tests, or for any other purpose. The Porteus test is one of the most interesting. Since no language enters into the test, one would expect it to correlate more highly with Pintner and Myers than with the Stanf ord- Binet and Alpha 2. Just the opposite occurs; of the four, by far the highest correlation is with Alpha 2. The cor- relation of Porteus and P. C. II is practically the same. These three tests all seem to call for one kind of ability. Is it good judgment, common sense ability, planfulness, deliberation, carefulness, foresight, good apperceptions? Probably it con- tains these and other similar traits. It is the difference be- tween these tests which brings the correlations down to .70, and which causes them to correlate differently with the other tests. Alpha 2 and Stanford-Binet, both requiring language, correlate more highly than Porteus and Stanford-Binet, or P. C. II and Stanford-Binet. Some other factor causes Alpha 2 and P. C. II to correlate considerably higher with Myers than Porteus does. There are always many traits measured by every test, no matter how simple, and the emphasis on the different factors is not always the same for the same test. It varies with the group being measured ; their age, sex, educa- tion, social selection, etc. Why does Porteus get a higher correlation between his test and Binet's than we do? Partly, at least, because he tested children of all ages, but especially younger ones, whereas ours group themselves closely about a mode, and are older. We have quite a number of cases which are not completely measured by either the Stanford-Binet or the Porteus tests; that is, they could probably succeed with some harder tests if they were given the opportunity, and this lowers our corre- lations. The fact that many of Porteus' cases were placed higher rather than lower on his tests than on the Stanford- Binet seems to show that they tended to be poorer in language ability than in planfulness, apperceptions, — whatever one wishes to call it. Our cases on the other hand seem to find no difficulty with the language factor. It is by comparing the results of the same group in different tests, and of different groups on the same test, that most can be learned of what the tests actually do measure. In this study we have the same group measured by many different tests. We find our inter- correlations high in many cases, but nowhere so high that we 44 SOME WELL-KNOWN MENTAL TESTS feel that the tests are identical. However certain similarities such as the one just discussed between the Porteus, P. C. II, and Alpha 2, were brought to light by this method. Differ- ences such as the striking one between Pintner and Myers have also been observed. If different groups had been used, one would be unable to draw any conclusions regarding the tests for the groups themselves might be responsible for so many of the factors. Again different factors of the tests are brought out by different groups as for instance a younger and older set of children tested with the Pintner non-language survey test would give an entirely different kind of intercor- relation between the separate tests. By this method we can take account of more factors and so interpret our findings with greater accuracy. There is no evidence that the P. C. test measures appercep- tions, that the learning tests measure learning ability, that the construction tests measure ability to use concrete material. On this account, and also because each involves too many inci- dental, disturbing factors, none of these tests can be consi- dered adequate measures of special abilities or disabilities. Such tests are much needed, and should be constructed so as to measure fundamental, underlying differences in ability. They must be correlated with everything of any possible im- portance in order to ascertain the degree to which one ability is related to all others. In studying memory we want to know how important a part it plays in reasoning, in mechanical work, etc. We must learn the significance of a good memory for every school study, and for various occupations. If dif- ferent kinds of memory play important parts in different studies and vocations, this too we must find out. It is a big task, perhaps impossible to carry out at present, but without such information we are tremendously handicapped. The taboo of "faculty" psychology has contributed to lessen activity along these lines, for if you investigate memory you are getting perilously near something obsolete. But very few would deny that there is such a thing as remembering, and all study of memory and its ramifications has yielded interesting and important results. It has been more or less tacitly assumed in the past that differences in performance are due to differences in the ma- terial used rather than to underlying "faculty" differences. This was based upon findings such as those obtained when EVALUATED AND COMPARED 46 c:ooo(M«ioooouiO o r H +J C pi? •CrHOOiHOOt- CI O_^03>H >H ■ 5tDcoo,HN t-eo S-S-S"^ ^coioeooo»joiooo }-* cn^,^ *^ bJ 2J to > "O • lo-^t-t- c-cgco S I' rt ""^ S ?C> t-" M te: g ■_ w THt- .2H *^ jjckoosco oo(Mco "^y-^S .So ^ g ^ ■» IT « 8 -S MM £ goo »00 00 eOi-(COCslC< •sjtoko t-t-eoeoio * _: o o oj o >. II II .1 • II II 'I « cs u -U05 -^t-ectoooN Ceo 00 Oi O lO CO 00 •£""<* lOlOTfCONN asil I ^TT I I MM M ^-r 2Wb§ § c c.Smo5CDt-^«0^Tj4 ^ ^ fe'^^M^wWW^^ H CO 00 lO Tj< CO Oi O S > r; rt -->>>> MMM >'>'>>xx >'>>2 2 «*&1 t:3S-ScHEHHiJQ^a^CC> 02 .S^^^ . O 0) es 46 SOME WELL-KNOWN MENTAL TESTS memory was tested. It was found that a good memory for logical material did not follow from a good memory for nonsense; that being able to remember visually presented facts did not necessarily indicate ability to remember what was heard. The result of these and similar observations has been the development of tests dealing with specific types of material, or — giving up the specific side entirely — tests of general intelligence. Our data seem to indicate that real, underlying differences do exist, if we only know how to get at them. In order to prove this, it is necessary to have a test with omnibus material, all of which is designed to measure a certain type of thing. We shall now proceed to do this. A COMBINATION TEST FOR PLANFULNESS The correlations in table XVI, particularly those obtained between Porteus, Alpha 2, and P. C. II, seem to indicate the possibility of a factor, common to all and largely determining the score on each, which has nothing to do with the material employed, that is, whether a language or non-language test, or the like. We have suggested above several names for this factor, — good judgment, common sense, deliberation, care- fulness, foresight, good apperceptions, planfulness, persis- tence, prudence and mental alertness in meeting a new sit- uation, ability to see the whole of a situation instead of re- acting to the most obvious part of it. An attempt was made to investigate it more thoroughly by combining the elements of each test which seemed most specifically to measure it. The selection was made from the Porteus, Myers, Alpha 2, P. C. II, and Stanford-Binet tests. All the tests selected would require about twenty-five minutes to perform, this being a liberal estimate based upon the time limit for each test. Alpha 2 has no definite time limit, but from the writer's experience, ten minutes would seem ample to allow for the parts of the test included in this selection. When all the in- dividual tests had been chosen, they were divided into two sections, and a self-correlation of .763 was obtained with 80 cases. The tests in each group were : I. Porteus — year 11 (scored 0, 1, 2) year 12 (scored 0, 1, 2,3,4). Myers — pages 4. Numbers 3 and 7 (scored each 0, 1). P. C. II — pictures 2 and 6 (scored 1 each if OK; other- wise 0). Alpha 2, Part II — difficulty 8 — number 4 (scored 0, 1). Pintner — test 5, numbers 5 and 7 (scored each 0, 1). Pintner — test 6, picture 2 pieces 2 and 1 (scored each 0,1). Pintner — test 6, picture 3, pieces 4 and 1 (scored each 0, 1). II. Porteus — year 10 (scored 0, 1, 2) year 14 (scored 0, 1, 2,3,4). 47 48 SOME WELL-KNOWN MENTAL TESTS Myers — page 4. Numbers 5 and 10 (scored each 0, 1). P. C. II — pictures 7 and 8 (scored 1 each if OK; otherwise 0). Alpha 2, Part II — difficulty 8 — number 1 (scored 0, 1). Pintner — test 5, number 6 (scored 0, 1). Pintner — test 6, picture 2, pieces 4 and 3 (scored each 0, 1). Pintner — ^test 6, picture 3, pieces 2 and 3 (scored each 0,1). Stanford-Binet — XIV years, number 6 (scored 0, 1). The Porteus tests were chosen because they were devised to measure this very thing. The fact that only one type of material — mazes — was included, was considered by Porteus one of the outstanding advantages of his test. We feel that this is a disadvantage since some children might have a dis- ability for working with this kind of material although possessed of common sense, foresight, etc. With omnibus material this special factor is overcome. The choice of the four most difficut tests was largely a matter of the distri- bution of the subjects. Too many would have made perfect records on the easier tests. The selection from Myers Mental Measure was based largely upon resistance to suggestion. In each case four pictures with some element in common must be chosen from eight possible ones and underlined. These four could not be too difficult or our subjects would all score 0; if they were too easy we would have no reason to believe that this characteris- tic pertained to them. Number 3 is the selection of four toys, — a tricycle, top, kite and rocking horse, vdth a soldier as the confusing picture. In number 5, four items made of iron must be chosen, — a stove, dagger, or sword, train, and lock. This has several confusing suggestions. There is a broom which might be associated with the stove, and two animals which might be connected with the train as they all are capable of locomotion. Number 7 consists of an' insect, a broom, a bird, a table, a butterfly, an aeroplane, a goat and a cow. The four things which can travel in air are to be underlined. The two animals prove confusing to many children. In number 10 the subject is to select four articles of wood, — two trees, a barrel and a table, with a snake, a camel, a cannon, and a EVALUATED AND COMPARED 49 bird to be omitted. Here also the three animals receive con- siderable attention, the hasty child not noticing that the fourth is lacking, or the snake is overlooked, the two remaining animals and two trees being classed together as objects possessing life. There seems to be some suggestion in each of these pictures, and it is certainly true that a careful, deliberate, performance by a subject who takes in the whole situation and responds to it will give far better results than a hasty, careless one. The pictures from P. C. II are those in which there are several obvious possibilities. A hasty, careless selection will hit upon the first possible one, rather than searching further for the exactly correct one. All correct pieces were checked by asking the subject why that particular one had been chosen and if it was put in by chance, no credit was given. The partial credits given by Healy were omitted, the picture scored either as perfect or a failure. This was necessary in order to eliminate the other possible factors which enter into solving the test partly. For instance, in the second picture, where a book is missing, it is not sufficient to put in any book, pencil case or lunch box, but by following up persistently all the clues, the one and only correct red book can be placed in the space with certainty. From the Thorndike Alpha 2 reading scale questions were selected which had been answered by a large number of children. Question I requires a fairly careful study of the paragraph in order to find just what it is that seems true at first but is really false. The question is a little clumsily put, — certainly not direct and to the point, — which is an advantage for our purposes. Question 4 is not a reading scale problem proper, but necessitates close attention to several directions. In two rows of digits the subject must underline every five that comes just after a two, unless the two comes just after a nine. If that is the case, he must draw a line under the next figure after the five. The last few lines of the first page of the Myers Mental Measure are similar to this, but the Alpha 2 was given to a larger number of cases, there was no time limit, and less possibility of copying, so it was given the preference, as being more accurate. Numbers 5, 6, and 7 from Pintner test 5 are all similar in nature. Given a drawing, the problem is to draw it in a reversed position, with two lines of the second position given 50 SOME WELL-KNOWN MENTAL TESTS on which to construct the rest. This seems like a rather special ability, but Pintner gives each drawing considerable weight in his total score, and persistence and planfulness are certainly essential for a good performance. Pintner test 6 consists of parts of pictures presented in a disarranged order. Each part is numbered and blank spaces are provided in which the subject is to place the numbers of the parts in order which would give a perfect ensemble. Here again planfuness, patience, and foresight are needed, and on the whole the subject who possesses them to the greatest degree will be the most successful. Finally one test was selected from the Stanf ord-Binet scale, — namely the reversed clock hands of year XIV. If two out of three were correct a score of one was given, if less no credit at all. This test seemed to require the same kind of ability as many of the other tests included, and was therefore added. Some of the other Stanf ord-Binet series might have been used also, but those which seemed desirable came too high or too low in the scale so that the distribution for our subjects would not be satisfactory. The correlation of .763 obtained between the two parts is fairly high when it is remembered that the highest score on each section can only be 17 ; also that the whole series of both parts would only take half an hour to give. As to reliability it is a noteworthy conclusion that this self-correlation is the highest one obtained with any non-identical material. A correlation of the composite tests with any of the tests which are included would probably give a high coefficient difficult to interpret because of the varying amount of each included in the composites, and a low correlation with learning tests, construction tests, or tapping could hardly be considered strong evidence in favor of our new grouping. But the correlation with Stanford-Binet seemed worth finding, and when worked out yielded a coefficient of .537. This indicates that our combination test is comparable with the whole series of tests from which it was compiled. We have, however, no criterion to prove that it actually measures the trait which we presuppose it does. But this same criticism applies to all the tests which are supposed to measure specific factors. Our new test combination of old material is certainly as good as the tests from which it originated; we think it is better, be- cause it gives evidence of measuring one trait, or group of EVALUATED AND COMPARED 51 traits with a variety of materials, whereas all the others measure many kinds of traits with identical or similar ma- terial. That is, the classification and material preparatory to the formation of a test has generally heretofore been along the lines of the material employed, such as form boards, etc., whereas the combination test being discussed presents the results obtained from forming a test directed toward plan- fulness, or other ability. CONCLUSION It is proposed to set forth the practical results of this study, to show the positive information that has been ascertained and also to show from the experience gathered in the course of obtaining such information, what further investigations should be made, with what purpose, and what methods may- lead to success. This study has reached some positive results and has disclosed other perhaps more valuable ones in the same field. In entering upon this study it was believed that the results of the method that has been pursued would justify the con- clusion that the Stanford-Binet series can be used as a test of general intelligence and that certain other tests used as auxili- aries would make apparent and give a measure of special abilities not individually measured by the Stanford-Binet. It was expected that the various tests would give reasonably high correlations with the Stanford-Binet and rather low correlations with each other, thue on the one hand establishing the reliability of the tests used, and on the other hand, the diversity of the abilities that were subjected to measurement. These results were anticipated because care was used in selecting the tests to take those which had an approved author- ship, an extended use, a definite purpose, and a general repu- tation of success in the field they purported to cover. That is, the various units had each been shown apparently to be satisfactory and on these a priori grounds it was thought that properly selected units used in conjunction would result in a reliable schedule. Had the results of the correlations been in harmony with this anticipated situation, we might properly have pointed to this study as a demonstration of the process by which sched- ules of tests for children should be composed. Looking upon our results as they have been reported upon, the fact is obvious that there is no such easy manner in which to arrive at reliable schedules of tests. Unexpected low cor- relations were obtained in some situations where the indicated results should have been high, and vice versa, and while our positive purpose therefore met with disappointing obstacles, a study of the figures as we have them led to other worth- while conclusions. 62 EVALUATED AND COMPARED 53 Drawing upon the results of the correlations, it can be stated with assurance that it will not be well to take tests upon which a high face value has been placed when they were used without being effectively valued by comparison, and combining a number of them in the expectation of using the combination to get reliable information as to the general intelligence and the special abilities of normal children. One of the best examples which we can show, as a result of this study, of the impropriety of such procedure is, that the type of material used does not govern the abilities tested. We obtained a higher correlation between a language and a non- language test than between two language tests or two non- language tests, similar examples can be drawn from the corre- lations listed above respecting other characteristics of various tests. Insofar therefore as authors of tests have relied upon the material as a quality that would single out and measure a certain one of many abilities, it seems clear that individual tests miss their purpose. However, the correlations did seem to show that something definite was being tested, so that if our purpose of finding a schedule of tests at once sufficient to measure both general and special abilities, was disap- pointed, at least the schedule we used can be relied upon for general abilities and that such a schedule is more reliable than the Stanford-Binet alone. The components of this schedule have been previously listed and it only remains to state what individual matters of interest relating to each were made clear in the course of the study which was directed to larger purposes. It was a matter of actual demonstration herein that all of the construction tests used are unreliable, this conclusion disproving the previously held opinion based upon empirical considerations to the effect that they reliably measure ability to handle concrete material. Persons having occasion to apply mental tests have too frequently overlooked the matter of how far the test can be relied upon. This is an important matter and consequently it should be of some interest to note that the reliability of the Stanford-Binet, Pintner non-language group test, Thorndike reading scale Alpha 2, Porteus Maze Test, and tapping test has been established, whereas the Myers Mental Measure, the Healy Pictorial Completion test II, the Healy-Bronner 54 SOME WELL-KNOWN MENTAL TESTS learning tests and the crossline tests are not yet definitely shown to be reliable. Care should also be observed in interpreting the results of correlations, for the mere fact of high correlation is only generally and not conclusively proof of reliability. There is the possibility that factors causing unreliability have been hidden — thus, in the tapping test, the high correlation with Stanford-Binet was deceptive owing to the fact that the scores on both increased with the age of the subjects. Other specific remarks relating to individual tests are contained in the results. There remains to state what considerations we have found to have a probable value as to future work in this field. If we found on the one hand that the type of material used in a test does not govern the ability tested, on the other hand there are some indications that to test individual abilities the test should have a variety of material. So far the elements of a desired test can be stated, but the further necessity of finding just what material is suitable, can only be determined by practical work consisting of correlation with outside criteria and with any other measures of claimed effectiveness in the field in question. As an experimental example, for the confines of this study would allow no more extended investigation, various parts of a number of the tests were united in a combination test in- tended to secure a measure of planfulness. The resulting correlations indicated success in this attempt. A similar or even greater measure of success may follow further com- binations aimed at the measurement of other abilities. It may also be stated as having been illustrated in the course of this study that the supposed merit of various mental tests based upon various, insufficient or unscientific criteria, such as mere hypothesis, or even practical results, if relied upon, may lead to misleading or dangerous conclusions, and that before one takes the responsibility of giving advice or of taking action with respect to information gained from the ap- plication of mental tests, there should be available the as- surance that proper comparative tests and correlations have verified the supposed propriety of relying upon the results. w .w VITA The author of this dissertation was born in New York, August 25, 1898. Secondary education was at Far Rockaway High School, taking highest honors, and receiving Regents Scholarship for College. Vassar College, 1915-1917; Barnard College, Columbia University, 1917-1919; B. A. De- gree Columbia University, 1919, Honors in Psychology; 1918, research work for New Jersey State Institution for Feeble Minded; 1919-1920, Fellowship at Judge Baker Foundation, Boston, Assistant Psychologist; Columbia University, 1920-1922, Post Graduate Work in Psychology. T'X THIS BOOK IS DUE ON THE LAST DATE STAMPED BELOW AN INITIAL FINE OF 25 CENTS WILL BE ASSESSED FOR FAILURE TO RETURN THIS BOOK ON THE DATE DUE. THE PENALTY WILL INCREASE TO 50 CENTS ON THE FOURTH DAY AND TO $1.00 ON THE SEVENTH DAY °^^^^ • Jun7'47PW nf^c 21 H.i JUN 9 ifiy: ■^^^ Pro 2^ m? Duppr-?.-r"'' affi'DJJ) »«R2l72-KflM31 • "i8'«: a LD 21-100/n-7,'39(402s m m 5n77r,3 UNIVERSITY OF CALIFORNIA LIBRARY