INTRODUCTION TO MATHEMATICAL STATISTICS BY CARL J. WEST, Ph.D., ASSISTANT PROFESSOR OF MATHEMATICS OHIO STATE UNIVERSITY COLUMBUS R. G. ADAMS AND COMPANY 1918 COPYRIGHT, 1918 BY CARL J. WEST PRESS OF THE F. J. HEER PRINTING CO. COLUMBUS. OHIO PREFACE. IT is the aim of this book to present certain topics of elementary statistical theory which have been found useful and workable. The statement would seem warranted that no more than the very simplest methods should be used by one who has no knowledge of the principles underlying the methods. Busy though the scientist may be, he owes it to the science and to the persons who may accept his results to have some familiarity with his tools. The blind application of formulas in statistics has been made possible by the convenient manuals that have ap- peared and has been encouraged by the fact that the theory has been so surrounded by intricate and involved mathematics that it was only by an extended research that a knowledge of the theory could be obtained. There is no real reason why the theory of statistical methods should remain in obscurity. The necessary mathematics is largely elementary arithmetic and except in a few cases there is no need for higher mathematics. This book presupposes a reasonable familiarity with elementary mathematics only. Because of the desire to eliminate higher mathematics from the body of the book the discussion of the theory of the Gen- eralized Frequency Curves of Pearson has been deferred to Ap- pendix I. For the same reason a discussion of the promising method of variate differences is omitted, as is the mathematical theory of random selection. While it is hoped that the statistical data of this book may be of interest in themselves they have been selected solely with reference to their usefulness in illustrating the theory. For this reason all examples and exercises have to do with very simple data. The author will appreciate notice of such numerical and other inaccuracies as may be found. The idea is emphasized that a formula or method to be of practical and trustworthy value to a statistician must be so simple and direct that the final results can be interpreted in terms (1) 3760/8 2 PREFACE of the original conditions or the given data. To illustrate, if the arithmetic mean is ten per cent, larger in one distribution than in another what difference does this variation indicate in the forms of the distributions or in the values of the two series of measure- ments? If one correlation ratio is 0.54 and a second 0.59 how much more closely related are the attributes in the second than in the first? It must always be remembered that mathematics is but a tool to be used when the desired results can be more efficiently attained by its use, and that a formula is nothing more than a statement in mathematical language of a method of com- putation already thought out and understood. The difficulties that may arise in this subject are not primarily mathematical. They are essentially a part of the necessarily difficult task of analyzing a statistical distribution. The preparation of a book on mathematical statistics to appeal to scientific workers in fields ordinarily considered to be non-mathematical is essentially a matter of experimentation. It is the hope of the author that this book may stimulate interest in the methods of presenting statistical theory and in the more inclusive problem of making mathematical theory more widely available. Any suggestions or criticism of this presentation will be appreciated. The Bibliography of Appendix II is inserted as a guide to advanced reading in the subject of mathematical statistics; the contributions of Prof. Pearson are to be noted especially. It seems hardly necessary to refer to the debt which any- one who works in statistical theory must owe to Professor Karl Pearson. Because of his "Tables for Statisticians and Biometricians" the formulas of Appendix I are not given in more detail. Professor James McMahon has given most generously of his time and interest. Whatever assistance this book may afford to the practical worker in statistics is in a large measure due to the influence of Professor Walter F. Willcox, whose critical insight into the limitations and the possibilities of statistical methods together with the originality and practical initiative which permeate his research and instructional work place all his students under obligations to him. CONTENTS. CHAPTER I. PACK CURVE PLOTTI XG 7 Plotting the Data. General Directions for the Laying-off of Scales. Connecting the Plotted Points. Directions for Plotting Curves. The Title of a Diagram. More than one Curve on the Same Diagram. Coordinates. Logarithmic Curves. Cumulative Curves. CHAPTER II. CURVE PLOTTING (Continued) 16 Interpolation. The Smoothing of a Curve. Smoothing by Inspection. The Preservation of Areas. The Adjusted Data : Interpolation. Test of a Graduation. Determining the General Trend of the Data. Periodic Data. CHAPTER III. FRKOI-EXCY CURVES 24 Definitions. The Construction of a Frequency Distribution. Plotting a Frequency .Distribution. Smoothing the Frequency Distribution. Use of the Frequency Curve. Errors in Representative Data. CHAPTER IV. AVERAGES 32 The Arithmetic Mean. Statistical Properties of the Arithmetic Mean. Theorem on the Sum of Deviations from the Mean. The Weighted Arithmetic Mean. (3) . CONTENTS PAGE Adjustment or Graduation Formulas. The Geometric Mean. Properties of the Geometric Mean. The Median. Quartiles. Deciles. Statistical Properties of the Median. The Probable Deviation. The Mode. Statistical Properties of the Mode. CHAPTER V. THE FORM OF THE DISTRIBUTION 45 Dispersion. Measures of Dispersion. Mean Deviation. Proof that Mean Deviation Smallest about the Mean. Statistical Properties of the Mean Deviation. The Mean Squared Deviation. Short Rule for the Mean Squared Deviation. The Standard Deviation. Properties of the Standard Deviation. The Coefficient of Variability. The Quartiles as Measures of Dispersion. Formula for the Probable Deviation. Probable Deviation of the Arithmetic Mean. Probable Deviation of the Standard Deviation. Statistical Significance of the Probable Deviation. The Deciles as Measures of Dispersion. Symmetrical and Asymmetrical Distributions. The Position of the Averages and Asymmetry. Skewness. Measures of Skewness. CHAPTER VI. THE NORMAL PROBARILITY CURVE 59 The Equation of a Frequency Curve. Statistical Theory of the Normal Curve. The Equation of the Normal Curve. The Graph of the Normal Equation. Areas under the Normal Curve. Preliminary Determination of Normality. Probable Deviation in a Normal Distribution. CONTENTS 5 CHAPTER VII. PACK THE CORRELATION TABLE 67 An Illustration. The Construction of a Correlation Table. Definitions and Symbols. Correlation. CHAPTER VIII. THE CORRELATION RATIO 74 The Mean as Representative of the Array. Regression Curves. Coordinate Axis. Correlation and Regression Curves. Mean Squared Deviation of the Means of the Array. The Correlation Ratio. Two Values of the Correlation Ratio for each Table. Limiting Values of the Correlation Table. Probable Deviation of the Correlation Ratio. Spurious Correlation. CHAPTER IX. THE COEFFICIENT OF CORRELATION 81 Linear Regression. The Equations of the Lines of Regression. The Coefficient of Correlation. Computation of r. The Relation between r and i). Limiting Values for r. Statistical Properties of the Coefficient of Correlation. Test for Linearity of Regression. Probable Deviations. CHAPTER X. CORRELATION FROM RANKS 87 Rank in a Series. Theorems. Ties in Rank. The Bracket Rank Method. The Mid-Rank Method. Probable Deviation of the Rank Coefficient. Perfect Rank Correlation. Uncorrelated Data. Correction Formula for the Rank. Coefficient. Corresponding Values of ?* x and ^V K v- 6 CONTENTS PAGE Probable Deviation of T xy from Ranks. Theorems. Accuracy of the Coefficient T xy . When Computed from Ranks. CHAPTER XL THE MOMENTS OF A DISTRIBUTION 95 Introduction. Transformation Formulas. Summation Methods. Correction Formulas for the Moments. Theorems. Summations. The Moments and the Equation of the Smoothed Curve. CHAPTER XII. FURTHER THEORY OF CORRELATION 108 A Second Concept of Correlation. Derivation of the Equations of the Regression Lines. The Relation between r and -n. The Coefficient r for non-linear Regression. The Most Probable Value of a Characteristic. Theorems. Correlation of Indices. CHAPTER XIII. THE METHOD OF CONTINGENCY 119 The Mean Squared Contingency. Properties of 0. Non-Quantitative Characteristics. The Four-fold and the Nine-fold Tables. Theorems. Appendix I. The Frequency Curves of Pearson 131 Appendix II. Bibliography 145 CHAPTER I. CURVE PLOTTING. Plotting the Data. Let us plot the following data* of the monthly precipitation at Columbus for the year 1916: January, 5.0 inches February, 1.5 inches March, 4.9 inches April, 2.3 inches May, June, July, August, 4.8 inches 3.5 inches 0.7 inches 3.2 inches September, 1.5 inches October, 1.8 inches November, 1.6 inches- December, 3.6 inches A horizontal straight line is first drawn and at equal distances on this line twelve points are located, one for each month. On a vertical line erected at the point corresponding to the month of January equal intervals are laid off, one for each inch of precipitation ; and these intervals are subdivided into tenths. The two series of points are called the scales. It is usual to des- ignate the horizontal and the vertical scale lines by O X and O Y respectively, as in Figure I. ! CT a P < > i C' : s a -> c H: > r- ^ ti < 3 P a u *- c c > c c 2 o 2 G9 S 1 O S o FIG. I. Monthly Precipitation at the Columbus Station for the year 1916. The January precipitation is 5.0 inches. Place a dot above the January, or beginning point, at a height corresponding to 5.0 inches on the vertical scale. The next point is directly above the second or February point at a distance corresponding to 1.5 inches. Continuing in this way we locate a point for each month ; the data is then said to be plotted or pictured point by point. * Annual Meteorological Summary, U. S. Weather Bureau, Colum- bus, Ohio. 1917. (7) ' INTRODUCTION TO MATHEMATICAL STATISTICS Exercises. 1. Plot the following March precipitation data.* 1879 38 1889 0.7 1899 . . 4 7 1909 .. 27 1880 . .. 24 1890 5.6 1900 2 6 1910 3 1881 4.0 1891 4 6 1901 1 8 1911 24 1882 48 1892 .... 2 2 1902 2 6 1912 46 1883 3 2 1893 1 9 1903 4 1 1913 8 1 1884 36 1894 1 8 1904 4 9 1914 .. 2.5 1885 .... 0.5 1895 .... 1.2 1905 1 9 1915 1.2 1886 1887 1888 3.9 2.6 3.8 1896 1897 1898 . 3.0 5.5 7.0 1906 .. 1907 .. 1908 . .. 4.6 .. 5.2 6.0 1916 4.9 2. Plot the following population data for the United States : 1790 3,929,214 1800 5,308,483 1810 7,239,881 1820 9,638,453 1830 12,866,020 1840 17,069,453 1850 23,191,876 1860 31,443,321 1870 38,558,371 1880 50,155,783 1890 62,947,714 1900 75,994,575 1910 91,872,266 In plotting this data take the numbers to the nearest million. General Directions for the Laying off of Scales. The object of any graphic representation of statistical data is to pre- sent a vivid picture and therefore a diagram too small or too large, or too wide or too narrow will not accomplish this purpose as efficiently as will a correctly proportioned diagram. This means that the widths of the horizontal and the vertical scale intervals must be carefully chosen in order to give the complete diagram the proper proportions. In determining the widths of the intervals account must be taken of the nature of the statistical material. If the data is so inaccurate, for instance, that the measurements can be determined only to the nearest million it would be improper to divide the scale into intervals corresponding to thousands. The wealth of the country and the value of manufactured articles are examples of statistics which do not admit of close subdivision. It is useless to have the scale intervals finer than the smallest difference which the eye can conveniently distinguish on the dia- *Annual Meteorological Summary, U. S. Weather Bureau; Colum- bus, Ohio. 1917. INTRODUCTION TO MATHEMATICAL STATISTICS 9 gram. This often means, even in the case of quite accurate ma- terial, that the figures of the data must be cut back; for in- stance in plotting population data for the United States one mil- lion nny be the smallest numerical difference that can be pictured on an ordinary sized diagram. Usually, as in Figure II, horizontal and vertical lines, called coordinate lines, are drawn to assist in carrying the divisions of the scales across the diagram. Care must be taken that these, lines are lightly drawn and are not more numerous than is neces- sary. Connecting the Points. The eye is assisted in passing across a diagram if the plotted points are connected by a curve. The curve may be either a series of broken straight lines joining the points or a continuous curve passing thru each point without sharp angles or abrupt changes in direction. Of the two methods the continuous curve is usually to be preferred because of the better appearance which it presents. In Figure II the points are connected by straight lines and in Figure III a continuous curve is drawn. Exercises. 3. Plot the curve of the 1916 rainfall at Columbus from the data of Exercise 1. FIG. II. The Plotted Points of FIG. III. The Points of Fig. II Monthly Temperatures connected connected by a continuous curve, by straight lines. 4. Plot the population curve from the data of Exercise 2. IO INTRODUCTION TO MATHEMATICAL STATISTICS Directions for Plotting Curves. 1. The general arrangement of a diagram should be from left to right and from bottom to top. 2. Figures for the scales of a diagram should ordinarily be placed at the left and along the bottom. 3. Whenever practicable, the vertical scale should be so chosen that the zero line will appear on the diagram. When this is not done it is well to indicate that fact by a break in the diagram. 4. The zero lines must be sharply distinguished from the other coordinate lines of the diagram. 5. The curve must be carefully distinguished from the coordinate lines. 6. The data should accompany the diagram either in the form of a tabular statement or placed directly on the diagram. The latter method of presenting the original data can sometimes be effectively used, especially when the number of items is not large. Underlying all rules for the construction of statistical dia- grams is the general direction : The diagram must be so ar- ranged as to present the data most effectively. Because of the great diversity of statistical material and of the wide variety of purposes for which data may be collected and presented it is not possible to lay down specific rules which are to be followed in every case. Whenever the vividness and accuracy of the sta- tistical picture is not sacrificed by so doing, the conventional and generally accepted ways should be followed. Exercises. 5. Plot the following data of annual precipitation.* 1879 .... . 31.? 1889 ,-'.. .. 28.5 1899 .... . 28.5 1909 36.6 1880 .... . 44.7 1890 .. .. 50.7 1900 .... . 30 3 1910 34.8 1881 .... . 47 1891 .. .. 42.1 1901 .... . 26.5 1911 . .... 43.4 1882 .... . 51.3 1892 . . .. 33.5 1902 .... . 34.2 1912 29.6 1883 .... . 48.9 1893 .. .. 38.1 1903 .... . 28.1 1913 40.9 1884 .... . 31.0 1894 .. .. 29.5 1904 .... . 31.5 1914 31.2 1885 .... . 43.3 1895 .. .. 30.7 1905 .... . 35.1 1915 39.9 1886 .... . 42.4 1896 . . .. 40.5 1906 .... . 33.7 1916 34.4 1887 .... . 30.3 1897 .. .. 41.2 1907 .... . 37.6 1888 ... . 35.1 1898 .. ... 41.3 1908 .... . 30.1 Since the lowest number of inches is 26.5 it is better to make a break in the vertical scale, starting the working scale with, say, 25 inches. Report of Columbus Station, U. S. Weather Bureau. INTRODUCTION TO MATHEMATICAL STATISTICS 6. Plot the following data of mean monthly temperatures.* II 1870 53 1 1880 52 2 1800 53 2 1000 52 1880 53 6 1800 53 2 1000 .... 53 8 1010 .... . 51.7 1881 54 2 1801 52 6 1001 51 8 1011 . 53.8 1882 . . . . 53 4 1802 51.3 1002 52 . 1 1012 . 50.8 51 8 1803 51 2 1003 52 1013 .... . 53.5 1884 52 5 1804 53 3 1004 50 2 1014 52 1885 40 1 1805 51 6 1005 51.5 1015 51.8 1886- 50 3 1806 ... 53 1006 52.7 1916 . 52.0 1887 52 5 1807 ... 52 1007 50.8 1888 . 51.0 18fl8 . 53.6 1008 . 53.5 7. : Plot the curve of Top Beef Cattle Prices from the following data :** 1801 7 15 1808 6 25 1005 . . 7 00 1012 11.25 180-} 7 00 1800 8 25 1006 7 60 1013 .. 10.25 1803 6 75 1000 . 7 50 1007 8 00 1014 . . . ..11.40 1804 6 40 1001 8 00 1008 8 40 1015 . 11 60 1895 .... 6 60 1002 00 1000 . . . 50 1016 ... ..13.00 180(1 6 50 1003 6 85 1010 8 85 1807 6.00 1004 .... . 7.65 1011 .... . 0.35 8. From the data of page 25 plot the 1016 beef cattle prices. 0. From the data of page 25 plot the 1805 beef cattle prices. The Title of a Diagram. Each diagram must be pro- vided with a brief and concise and yet accurate and comprehen- sive title. The title must cover all of the data and not merely a certain section of it and it must do this without being of undue length. A careful study of examples of titles is especially help- ful in acquiring a notion of what constitutes a proper title. All headings of columns must be clear and definite. The units of measurement of a scale must always be given ; thus, " Precipitation in inches," "Temperature in degrees'. Titles and headings have a better appearance when made in Roman characters than when made in script. In general the size of type in each heading or sub-title should correspond in size and prominence to its respective importance. Unless the letter- ing is skilfully done by hand it is better to use a typewriter even tho different sizes of letters cannot be secured by its use. * Report of Columbus Station, U. S. Weather Bureau. ** Chicago Live Stock World, January 2, 1017. 12 INTRODUCTION TO MATHEMATICAL STATISTICS Exercises. 10. Study the titles and headings of the diagrams and tables of Vol. V, Report of the United States Census, 1910. 11. Study the titles shown in "Graphic Methods for Presenting Facts" by Willard C. Brinton.f 12. Study the titles and headings of the current issue of the Monthly Crop Reporter, Department of Agriculture. In each of the following exercises construct a cont^lete statistical diagram with the curve carefully drawn and an appropriate title de- signed for each. 13. The Land area of the United States exclusive of Outlying possessions from Table 18, Vol. I, Report of the United States Census, 1910. 14. The population of Ohio from Table 10, same report. 15. Comparative Values of Inside Lots of Different Depths ac- cording to the Lindsay-Bernard system of valuation. The Lindsay-Bernard and Somcrs Valuation Schedule.* Lindsay- Bernard. Somers. $82.0 ........... $93.33 84.2 ........... 95.60 86.2 ........... 97.85 88 ........... 100.00 89.6 ........... 102.08 91.1 ........... 104.00 92.5 ........... 105.78 93.8 ........... 107.50 95 ........... 109.50 96.1.. 110.50 Depth. 5. 10 Lindsay- Bernard. $9 ... 15 Somers. $14.35 25.00 Def>th. 85. 90 15 21 ... 32.22 95 20 27 41.00 100 25. 33 47.90 105. 30. 35 38.5.... 44 54.00 59 20 110. 115 40. 45 49 54 64.00 68 45 120. 125 50. 58.5.... 72.50 130 55. 63 .. 76 20 135 60 67 79 50 140 65 70.6.... 82.61 145 70. 75. 73.9.... 76.9.... 85.60 88.30 150. 175 80. 79.6... 90.90 200. 97.2, 98.2. 99.2. 100 . 103 . 105 111.80 113.00 114.50 115.00 119.14 122.00 16. Comparative Values of Inside Lots of Different Depths ac- cording to the Somers system. 17. The accumulated value of $1 at 10% compound interest: Year. 123456789 10 Amount .. 1.00 1.10 1.21 1.33 1.46 1.61 1.77 1.95 2.14 2.36 fThe Engineering Magazine, 1915, N. Y. *The National Real Estate Journal, May, 1914. INTRODUCTION TO MATHEMATICAL STATISTICS 18. The Average Yield per Acre for Wheat in the United States since 1866; Yearbook, Dep't of Agriculture. 19. Average Farm Price per bushel of Wheat in the United States since 1866; Yearbook, Dep't of Agriculture. 20. Per cent of Wheat Crop Exported since 1866 ; Yearbook, Dep't of Agriculture. 21. Total Production of Wheat to nearest 10 million bushels in the United States since 1866 ; Yearbook, Dep't of Agriculture. 22. Substitute the word Corn for Wheat in Exercises 17 to 20 and construct the curves. 23. Bank Clearings of the United States, excluding N. Y. Bank Clearings of U. S. excluding N. Y. (in millions). 1883 $14,209 1884 12,919 1885 13,170 1886 15,513 1887 17,566 1888 18,397 1889 20,280 1890 23,370 1891 23,198 1892 25,660 1893 . 23,049 1894 $21,298 1895 23,507' 1896 ,v, 22,304 1897 23,895 1898 26,959 1899 33,416 1900 33,771 1901 39,152 1902 41,695 1903 43,239 1904 . 43,972 1905 $50,087 '1906 55,327 1907 57,994 1908 53,133 1909 62,249 1910 66,821 1911 67,857 1912 73,209 1913 75,181 1914 72,225 24. Percapita Imports of U. S. : 1860 $11 25 1879 1861 ... 1862 9.02 5.79 1880 1881 1863 ... 1864 ... 1865 .. 7.29 9.30" 6.87 1882 1883 1884 1866 12.26 1885 1867 . . 10.23 1886 1868 9.94 1887 1869 .. 1870 .. 1871 11.60 11.97 14.47 1888 1889 1890 1872 16.15' 1891 1873 14.27 1892 1874 13.13 1893 1875 11.43 1894 1876 .. 1877 .. 1878 .. 9.47 10.37 9.07 1895 1896 $10.52 13 88 1897 .... 1898 . $10.32 8 66 13.06 1899 10 68 14.36 1900 10 86 12 81 1901 11 34 11.48 1902 . . 12 30 10 49 1903 12 42 11.57 1904 .... 12 7] 12.09 1905 14 24 12 11 1906 15 69 12.58 1907 16 29 13 15 1908 12 54 12.96 1909 ... 16 28 .. .. 12.91 1910 10 94 11 68 1911 16 32 9.97 1912 .. 19 04- 11 60 1913 18 47 9 66 1914 .. . . . 18 14 Note that the data of the two preceding exercises shows a de- cided periodicity or wave-like nature. 14 INTRODUCTION TO MATHEMATICAL STATISTICS More than One Curve on the Same Diagram. For the purpose of comparing different curves it is often convenient to plot two or more curves on the same diagram. For instance, simultaneous variations in the prices of wheat and corn can be observed to good advantage, when the two curves are brot together on the same diagram and constructed to the same scales. The chief disadvantage of this method of comparing curves lies in the resulting complexity of the diagram. If the diagrams are constructed on thin paper and the lettering and curves are made heavy the different curves when made on separate sheets can be readily compared by adjusting one sheet of paper above the other. Exercises. 25. Compare the rainfall curve of Exercise 5 with the temperature curve of Exercise 6. To" what extent do the two curves vary in the same directions? What conclusions can be drawn as to the tendency for the amount of rainfall to depend on the temperature? 26. Compare the two systems of real estate valuation of Exer- cises 15 and 16. 27. Give a comparative interpretation of the curves of Exercises 18 and 19. Why should they not be expected to follow exactly the same general course? 28. Discuss as in Exercise 27 curves of prices and yield per acre of corn. 29. Compare the curves of Exercises 21 and 23. Coordinates. It is convenient to have a standardized notation for the horizontal and vertical scales. The horizon- tal line is denoted by O-X and called the axis of abscissas or simply the X-axis. The vertical line is denoted by O-Y and called the axis of ordinates or the Y-axis. The point where the two lines meet is the origin of coordinates. Dis- tances along the X-axis are spoken of as x distances or x coordinates, and those along the Y-axis as y distances, or y coordinates. Thus in the precipitation data of page 7, the origin is at January, 1879, and the values of X differ by intervals of one month, while the unit interval for Y is one inch. Logarithmic Curves. Whenever the data seems to ex- hibit a uniform rate of increase or whenever it is desired to study the relative changes rather than the actual changes in the INTRODUCTION TO MATHEMATICAL STATISTICS 1 5 data, a logarithmic curve may be of service.* A logarithmic curve is obtained by taking the logarithms of the measurements and using these logarithms as vertical distances or ordinates. Since multiplying two numbers adds their logarithms, a constant ratio or rate will appear in the logarithmic diagram as a con- stant addition. Hence if there is a constant rate in the data the logarithmic curve will be a straight line. Whether the rate is constant or not, curves of this type are of value for com- paring different rates. However, if the rate is not approximately constant considerable familiarity with logarithms is necessary if the comparative differences are to be correctly interpreted. Exercises. 30. Plot the logarithmic curve of the data of Exercise 17. 31. Plot the logarithmic curves of the data of Exercises 15 and 16. 32. Plot the logarithmic curve of the Chicago Top Beef Cattle Prices. Cumulative Curves. All the preceding curves show the respective values for each interval of the horizontal axis, as the production of wheat for each year since 1866 is shown by the curve of Exercise 21. Now if it is desired to construct a curve exhibiting at each year the total production of wheat since 1866, the amount of each year's production is added to that of all the preceding years and the resultant cumulative sums plotted. In this way a curve is obtained which starts at the lower left hand corner and proceeds in a diagonal direction across the diagram. It is called a cumulative curve. The values to be plotted will be, in the case of the cumulative curve of wheat production, 150,000,000, 360,000,000, 580,000,000, 840,000,000, etc. Exercises. 33. Plot the cumulative curve of wheat production. 34. Plot the cumulative curve of corn production and compare with the curve of Exercise 33. 35. Of what significance is the slope of a cumulative curve? * See "The Ratio Curve," Fisher. Quarterly Publications American Statistical Association, June, 1917. CHAPTER II. CURVE PLOTTING (Continued.) Interpolation. The curves of the preceding chapter were drawn for the purpose of connecting the plotted points in order to assist the eye in following the course of the data across the diagram. However, other uses can be made of a statistical curve. At the beginning of Chapter I the data of monthly pre- cipitation is given. What was the weekly precipitation? The Chicago Top Beef Cattle monthly prices are given under Exercise 7, Chapter I. What were the weekly prices during the period covered by that data? The population of the United States is given for ten-year intervals. What has been the population from year to year? These are essentially questions of interpolation, that is, of estimating values lying between the given values. The method of obtaining intermediate values from the curve consists merely of measuring on the vertical scale the height of the curve at the required point. Thus with the population curve of Exercise 4, Chapter I, which is constructed from the decennial census reports, the population for the year 1906 is given by the height of the curve above the 1906 point on the horizontal scale. Exercises. 1. Estimate the Top Beef Cattle Prices for each week in February 1916, from the monthly data of Exercise 7, Chapter 1. 2. Estimate the values of inside lots for the fraction of a foot, say 67.5 feet, from the data of Exercise 16 of the preceding chapter. 3. What is, according to the data of Exercise 17 of Chapter I, the compound amount of $1 for 7.5 years at 10%? This method of interpolating makes an estimated value depend on the two consecutive given values which inclose it. But the increase in population during a decade may have oc- curred almost entirely during the last years of the period and (16) INTRODUCTION TO MATHEMATICAL STATISTICS 17 yet the shape of the curve when drawn merely to connect the ten-year points may give no hint of this irregularity of increase. The temperature for one month may have no connection with that of the preceding month and hence the curve between the points, depending as it does on the two non-related values can hardly be expected to give the actual temperature for an inter- mediate week or day. If the price of wheat for the year 1905 is omitted can it be reliably estimated by drawing the curve from the years 1904 and 1906 and then interpolating for the missing year? It must be apparent therefore that a curve which passes thru a series of more or less non-related points can be of little value in interpolation and that the problem of interpolation is essentially one of determining by some means or other the general course of the data and then estimating the intermediate values in conformity with this general trend. The values ob- tained in this way are the most probable values ; accidental varia- tions which bear no relation to the underlying tendencies can not be so estimated ; in fact such variations can not be estimated or predicted by any means. The Smoothing of a Curve. The curves of Chapter I, drawn as they are thru each point, preserve all the variations whether they are fundamentally essential or due merely to the presence of accidental influences. The curve of mean monthly temperatures, Exercise 6 of the preceding chapter, shows dis- tinct seasonal variations in temperature higher temperatures in summer and lower in winter. Along with these essentially significant changes are fluctuations apparently accidental as, in one year June is warm and in another relatively cool ; some- times January is warmer than February and sometimes the reverse is true. To represent a general movement or trend the curve must be drawn without abrupt changes in direction and must sweep among the points rather than necessarily thru each point. Since such a smoothed curve, as it is called, depends on the general or collective characteristics of the data the draw- ing of it must be based on collective properties of the measure- ments. One pertinent general property has just been stated ; namely, that the curve must be smooth, that is, not have abrupt l8 INTRODUCTION TO MATHEMATICAL STATISTICS changes in direction. This property expresses the statistical assumption that the significant variations are fairly uniform from value to value and not capricious or arbitrary. A second assumption, which is presently discussed, is that certain areas are relatively stable and unchanging. Smoothing by Inspection. The smoothing of a curve may be based on a study of the data and made a matter of the skill and experience of the statistician without the assistance of definitely stated assumptions or properties. The curve is then said to be smoothed by inspection. In smoothing a curve the first step is to study the data carefully. Without such an investigation into the probable sources and extent of the irregularities and fluctuations one- cannot hope to know what irregularities to smooth out and what to leave in. A curve cannot be reliably smoothed by a statis- tician who does not know the data thoroly. On the basis of the information gained by this study a preliminary curve should then be drawn freehand among the points. By successive erasures and redrawings the finished curve can gradually be arrived at. Thus a curve showing the long time movements in the price of wheat will pass above some points and below others' and how much the curve should miss any point can not be deter- mined without a knowledge of financial conditions, yields, etc. The inspection method of smoothing a curve is often suf- ficiently accurate for all practical purposes, especially when done by a statistician of experience and especially when there is a considerable element of inaccuracy inherent in the data. Its disadvantage lies obviously in the fact that no two smooth- ings of the same curve will be exactly alike; the method is es- sentially tentative and personal. In any event a rough preliminary draft of the curve should be made by inspection before proceeding to apply more re- fined methods. Exercises. 4. Smooth the illustrative data at the beginning of Chapter I. 5. Smooth the data of the population of the United States as given in Exercise II, Chapter I. 6. Smooth the data of annual rainfall of Exercise 5, Chapter I. . 7. Smooth the data of Exercises 18, 19, 20 and 21 of Chapter I. INTRODUCTION TO MATHEMATICAL STATISTICS 19 - iis ill -7,rr fl pr : ;::nyT' U n| ^l ^ ! '&V \ ~ \ rtcv -~- FIG. IV. The Smooth Curve of Monthly Precipitation at Columbus, 1916. The Preservation of Areas. In the illustrative data at the inning of Chapter I the precipitation of 4.9 inches in March total precipitation for the whole month. With a base of one unit, then a rectangle of height 4.9 will have an area equal to the total precipitation. Likewise the rectangle on the July unit as a base will have an area equal to 0.7, which is the July precipitation. The prices of Exercise 7, Chapter I, can in a similar manner be represented by rectangles with heights equal to the respective prices and with unit bases. The population data of Exercise 2 of the same chapter may be represented by rectangles which are not adjacent and have nine rectangles omitted between successive census years. After the curve is smoothed each rectangle will be altered so as to have a curved top. The total area under the finished curve will then be the sum of the areas of the modi- fied rectangles. The First Rule of Preservation of Areas is that the curve should be so smoothed that the 1 total area under the resulting curve is equal to the sum of the areas of the original rectangles. Since, for instance, the monthly precipita- tion is made up of the sum of the daily precipitations it is like- \\i-c reasonable to assume that the monthly sum is more stable than is the daily or weekly and hence we have the Second Rule of the Preservation of Areas; namely, where possible, the areas of the individual rectangles are to remain unchanged. This can be done by adding to and subtracting from each rectangle an equal sum. Within the requirement that the curve must be free from abrupt changes in direction the two preceding working rules furnish a fairly comprehensive basis for the smoothing of 2O INTRODUCTION TO MATHEMATICAL STATISTICS statistical data. In later chapters more detailed rules will be dis- cussed and applied. However, for most data the present rules are sufficient. As explained for the precipitation data a definite statistical meaning can usually be found for the rectangles. Even when a significance is with difficulty ascribed to the rectangles they should be drawn and the same rules applied to the smoothing as before. The method is in such cases justified wholly by its practical con- venience. In the illustrative plotting, at the beginning of Chapter I, of the data of monthly precipitation at Columbus for the year 1916, the vertical scale was laid off on a line thru the January point. In constructing the rectangles for smoothing, it is con- venient to have the January and other perpendiculars at the middle of the respective intervals in order that there may be a half unit's space at the left of the beginning point. The zero point on the horizontal scale is then at the beginning of the first interval and the vertical distance for the first point is taken not on the vertical scale line but perpendicularly above the mid-point of the interval. Whenever the curve is to be smoothed the scale is marked off in this way; ordinarily the method of Chapter I is employed where the curve is not to be smoothed. * The following diagram illustrates the application of the rectangle method of smoothing to the monthly precipitation data. FIG. V. The Rectangle Method of Smoothing the Monthly Precipitation data for Columbus in 1916. Exercises. 8. Construct the smoothed curve of prices from the data of Ex- 'ercise 7, Chapter I. 9. Do the same for the data of Exercise 5 and of Exercise 6, of the same chapter. INTRODUCTION TO MATHEMATICAL STATISTICS 21 10. Do the same for the data of Exercises 18, 19, 20 and 21 of the same chapter. 11. Do the sam for the data of Exercise 22 of the same chapter. 12. Can the rules of permanence of areas be applied effectively to the drawing of the curve for the data of Exercise 17 of the preceding chapter? Why? To the data of Exercises 15 and 16 of the same chapter? !">. In drawing the smooth curve of decennial census population it is advisable to alter the original data very slightly, if at all. Discuss. A common way of drawing this curve is to connect the ten year points by a series of straight lines and then round out the angles where the lines intersect. This assumes a uniform annual increase during the der cade an assumption which may or may not be true. 14. The statistical significance of the rectangles has been discussed for the precipitation data. Develop the corresponding explanation for the decennial census data. 15. Show that in the data of Exercises 15, 16 and 17 of the preceding chapter the rectangles are not significant. The Adjusted Data; Interpolation. Since in general it is impossible to preserve exactly the area of each rectangle the process of smoothing will lead to values differing from those of the original data. Consequently, the data is said to be adjusted or graduated or smoothed by means of the curve. In accordance with the reasoning at the beginning of this chapter the adjusted values are to be taken as giving a more significant idea of the true trend of the data than does the original data. It is evident that we have here the solution to the problem of interpolation. Therefore, the rule for interpolation is: to obtain the value at any point on the hor- izontal scale measure the corresponding ordinate of the smoothed curve, or measure the proper area under that curve. Thus the rainfall during the first week inJune is obtained by measuring the area under the curve on the first one-fourth of the June base unit. Test of a Graduation. The extent to which smoothing preserves the areas of the individual rectangles is often taken as a test of the appropriateness of the smoothing or gradua- tion. The smoothed curve is said to fit the data and the term "goodness of fit" is used to denote the appropriateness of the methods used in the process of constructing the smooth curve. The goodness of fit is then measured by the extent to which the areas of the individual rectangles are preserved. In 22 INTRODUCTION TO MATHEMATICAL STATISTICS applying this test two columns of numbers are set down, in one the original values and in the other the adjusted values. The differences are then taken and studied. Other conditions being equal the smoothing with the smallest differences is the best, tho the judging of goodness of fit is largely a matter of experience. Exercises. 16. Discuss the goodness of fit of each of the curves smoothed in the preceding exercises. 17. What is the best estimate on the basis of the data of page 25 of the Top Beef Cattle Prices for the first week in February, 1916? Note that in this data the rectangles have no special statistical sig- nificance. 18. From the data of Exercise 23 of the preceding chapter what is the best estimate of the bank clearings in the United States for the first half of the year 1908? 19. What is the significance of the rectangles in the case of the data of Exercises 14, 15, 16 of Chapter 1? 20. In drawing the curves of Exercise 19 should the values be ad- justed? Are these curves drawn by a process of smoothing? Determining the General Trend of the Data. The char- acteristics of a movement over a number of years can be deter- mined from the smoothed curve. Thus the general upward trend of prices during the years 1897 to 1917 is shown by the rise of the curve. Perhaps the best way to picture a general movement in the data is to draw a straight line, or more than one straight line where there seems to be more than one distinct movement, to fit the data. That is, to smooth the data with a straight line. With data not conforming closely to a straight line there is likely to be some uncertainty in the exact location of the straight line or lines but since the lines are but the pictures of the ideas of gen- eral increases or decreases the uncertainty is neither greater nor less than is the uncertainty in the ideas of the general movements themselves. The difficulty, in reality, is due to a lack of in- formation regarding the data. The methods of Chapter X are of much service in this connection. Exercises. 21. During the last 37 years has there been an appreciable increaa or decrease in the precipitation at the Columbus Station? 22. During the same time has there been a decided upward c downward movement in temperatures at the same place? INTRODUCTION TO MATHEMATICAL STATISTICS 23 Periodic Data. In smoothing and determining the gen- eral trend of data care must be taken that the data is not smoothed to conform to a straight line when there is an inherent periodicity in the material. The data of Exercises 23 and 24 of Chapter I exhibit significant tendencies for the values to be high for a few years and then consistently lower for a few years and then higher, and so on, thru more or less regular and uniform cycles. In smoothing such data the ideal should be to determine a uniform cycle and then smooth the data into the curve made up of the determined cycles. The problem of smoothing such data is complicated by the fact that the curve in addition to being composed of a series of similar loops or arches also has a ten- dency to rise or fall. Thus the imports of the U. S. have in- creased on the whole during the last 50 years tho there have been increases and decreases following each other in fairly regular periods. Exercises. 23. Smooth the data of Bank Clearings as given in Exercise 23 of the preceding chapter. 24. Smooth the data of Imports as given in Exercise 24 of the pre- ceding chapter. Jo. To what extent has there been a tendency for bank clearings and for imports to increase during the period covered by the given data? JO. Discuss the periods in the yield per acre of wheat in the U. S. 27. Do the same for the production of wheat. 28. Summarize the uses and advantages of the smooth curve as compared with the curve which passes exactly thru each point. CHAPTER III. FREQUENCY CURVES. Definitions. The following data of the measures of heights of 750 students* may be taken for purpose of illustration. The measurements are classified to show the number of in- dividuals for each inch of height. Height. Number. Height. Number. 61 2 68 126 62 10 69 109 63 11 70 87 64 38 71 75 65 57 72 23 66 93 73 9 67 106 74 4 750 TABLE I. Height, the attribute or characteristic here under con- sideration, is in this table measured to the nearest inch, giving a group or class interval of one inch. A class interval or class is ordinarily designated by the value of its middle measure- ment, and the class limits are located on either side at a half unit's distance from this mid-value. All individuals, for in- stance, with height between 67.5 and 68.5 belong to class 68; here the limits are 67.5 and 68.5 and the class is designated by the number 68. Instead of using 61, 62, 63, etc., as class numbers, the classes may be simply numbered i, 2, 3, etc., and these numbers used as class numbers. Again, the classes may be numbered in both ways from some point within the range, as 68. This would give class numbers as follows : - 7, 6, -5, 4, 3, 2, i, o, -f I, + 2, etc. The objects measured or enumerated are referred to as variates or simply as individuals. * Records of physical measurements at Ohio State University Gym- nasium, Freshman class, 1913. (24) INTRODUCTION TO MATHEMATICAL STATISTICS 25 The size or frequency of a class is*the number of indi- viduals within that class, and the total frequency is the sum of all the class frequencies. The table as a whole constitutes a frequency distribution of height, and shows the number of times each class occurs. To illustrate the method of constructing a frequency dis- tribution let us take the following data : * Chicago Monthly Top Beef Cattls Prices. Year. Jan. Feb. Mar. Apr. May, , June. July. Aug. Sept. Oct. Nov. Dec. 1916 .. $9.85 $9.75 $10.05 $10.00 $10.90 $11.50 $11.30 $11.50 $11.50 $11. 60 $12. 40 $13.00 1915 ... 9.70 9.50 9.15 8.90 9.65 9.95 10.40 10.50 10.50 10.60 10.55 11.60 1914 .. 9.50 9.75 9.75 9.55 9.60 9.45 10.00 10.90 11.05 11.00 11.00 11.40 1913 9.50 9.25 9.30 9.25 9.10 9.20 9.20 9.25 9.50 9.75 9.85 10.25 1912. ,.. 8.75 9.00 8.85 9.00 9.40 9.60 9.85 10.65 11.00 11.05 11.00 11.25 1911 ... 7.10 7.05 7.35 7.10 6.50 6.75 7.35 8.20 8.25 9.00 9.25 9.35 1910 ... 8.40 8.10 8.85 8.65 8.75 8.85 8.60 8.50 8.50 8.00 7.75 7.55 1909 ... 7.50 7.15 7.40 7.15 7.30 7.50 7.65 8.00 8.50 9.10 9.25 9.50 1908 6.40 6.25 7.50 7.40 7.40 8.40 8.25 7.90 7.85 7.65 8.00 8.00 1907 . 7.30 7.25 6.90 6.75 6.50 7.10 7.50 7.60 7.35 7.45 7.25 6.35 1906 6. SO 6.40 6.35 6.35 6.20 6.10 6.50 6.85 6.95 7.30 7.40 7.90 1905 6 35 6.45 6.35 7.00 6.85 6.35 6.25 6.50 6.50 6.40 6.75 7.00 1904 5 90 6.00 5.80 5.80 5.90 6.70 6.65 6.40 6.55 7.00 7.30 7.65 1903 ... 6.85 6.15 5.75 5.80 5.65 5.15 5.65 6.10 6.15 6.00 5.85 6.00 1902. 7.75 7.35 7.40 7.50 7.70 8.50 8.85 9.00 8.85 8.75 7.40 7.75 1901. 6.15 6.00 6.25 6.00 6.10 6.55 6.40 6.40 6.60 6.90 7.25 8.00 1900 ... 6.60 6.10 6.05 6.00 5.85 5.90 5.85 6.20 6.15 6.00 6.00 7.50 1899 6.30 6.25 5.90 5.85 5.75 5.75 6.00 6.65 6.90 7.00 7.15 8.25 1898 ... 5.50 5.85 5.80 5.50 5.50 5.35 5.65 5.75 5.85 5.90 6.25 6.25 1897 . . . 5.50 5.40 5.65 5.50 5.45 5.30 5.25 5.50 6.00 5.40 6.00 5.65 1896 ... 5.00 4.75 4.75 4.75 4.55 4.65 4.60 5.00 5.30 5.30 5.45 6.50 1895. . 5.80 5.80 6.60 6.40 5.25 6.00 6.00 6.00 6.00 5.60 5.00 5.50 The width of the classes must be first determined. It would be possible to have a class for each quotation but it would be found highly inconvenient. The error introduced by the grouping of the measurements, the quotations in this case, is ordinarily of no practical significance. A general rule in determining the width of the classes, and hence of the number of classes- is to make as wide classes as is practically feasible the number of classes is perhaps most often from ten to twenty. In this case the width is taken as fifty cents and the limiting quotations of each class are included in the class. The data is examined and a score made for each occurrence of the class. Thus Class I with the range 450-499 appears Feb., Mar., Apr., May, June, July, 1896; as an occurrence is observed * Yearbook, Chicago Live Stock World, 1017. 26 INTRODUCTION TO MATHEMATICAL STATISTICS a mark or score is made six in all. After the scoring is com- pleted the frequency of each class, that is, the number of tallies or scores for each class, is noted and written in a column. The frequency distribution just obtained shows the number of times each price-class has occurred during the last twenty- one years, Exercises. 1. Construct a frequency table from the Top Beef Cattle Prices using class intervals of twenty-five cents and compare with the distribu- tion obtained when the class interval is fifty cents. 2. From the following table of Mean Monthly Temperatures at the Columbus Station construct a frequency table with a class width of five degrees. Year. Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct. Nov. Dec. 1878 78.6 74.0 65.9 53.5 43.1 26.2 1879 25.6 28.9 41.6 50.3 64.1 71.4 78.4 71.2 61.4 62.4 43.8 37.2 1880 . . 43.8 38.8 40.8 53.8 68.8 72.8 75.1 74.2 65.4 52.2 32.8 24.6 1881 24.2 29.2 36.8 47.6 67.8 69.7 78.6 75.8 74.6 60.5 44.4 40.6 1882 32.6 41.8 44.8 51.0 57.4 69.9 71.8 72.2 66.2 59.0 42.6 32.0 1883 27.1 34.3 35.3 51.0 60.3 70.7 74.1 70.2 64.0 55.6 44.7 34.8 1884 20.0 37.2 39 2 49.4 61.7 72.9 73.8 72.8 71.0 58.9 41.4 32.2 1885 22.9 19.4 33.1 50.0 61.2 69.0 76.6 71.1 64.0 61.4 40.9 32.4 1886 23.4 26.8 39.0 54.7 62.8 67.3 72.2 71.6 65.8 54.4 38.8 27.2 1887 26.8 36.4 37.3 50.8 67.6 71.2 79.6 70.8 64.0 51.4 41.7 32.8 1888 26.8 33.0 36.5 51.2 60.6 71.6 73.2 71.4 C1.3 48.7 44.1 34.2 1889 34.2 26.4 42.2 61.8 61.4 67.7 74.1 70.2 63.8 49.0 41.2 44.6 1890 39.1 40.6 35.2 52.3 60.0 74.6 73.6 70.2 63.1 53.8 44.6 31.8 1891 33.0 36.8 34.8 52.9 57.6 72.4 70.0 71.0 69.4 52.8 40.3 40.0 1892 24.0 35.7 36.0 49.4 61.0 74.2 74.0 73.0 65.4 53.6 38.2 30.0 1893 18.8 30.8 39.5 51.6 59.4 71.4 76.4 71.8 66.7 55.1 40.0 33.0 1894 . . 34.7 29.4 46.2 51.5 60.6 72.4 75.2 72.2 69.1 54.8 39.0 34.6 1895 . . 24.1 21.0 36.7 53.2 62.8 74.9 73.8 75.6 71.2 48.2 42.4 34.9 1896 30.8 31.8 33.7 58.6 69.7 70.8 74.4 72.9 63.2 50.4 45.2 36.4 1897 26.4 34.0 43.1 50.6 57.9 69.8 77.2 71.1 68.8 59.8 42.7 33.7 1898 33.2 31.5 46.3 48.6 62.7 73.8 77.7 75.0 69 8 55.0 39.6 30.0 1899 29.4 22.8 38.3 55.2 65.4 73.4 76.2 75.8 65.7 59.2 45.3 31.1 1900 . .. 32.8 26.8 34.5 51.8 65.4 71.9 76.2 78.5 71.2 62.2 42.4 32.5 1901 30.4 23.5 40.6 48.5 61.1 73.4 79.9 74.6 66.6 55.7 38.6 28.7 1902 28.8 23.2 42.7 50.1 65.0 68.2 75.2 70.6 65.2 56.2 49.8 30.7 1903 28.3 31.8 47.8 51.2 65.8 66.0 74.5 73.0 67.6 55.4 38.4 24.6 1904 22.8 24.8 40.9 45.0 62.1 69.6 73.4 71.0 67.1 54.0 42.4 29.0 1905 24.0 22.0 44.4 50.4 62.3 70.8 74.9 73.3 67.1 53.7 40.6 34.2 1906 36.6 28.4 32.0 54.7 62.8 71.0 73.2 75.7 70.0 53.0 42.4 32.7 1907 33.5 27.2 46.8 43.0 55.7 67.2 73.9 71.0 66.7 50.0 40.2 34.6 1908 30.0 29.6 44.5 51.7 64.0 70.6 75.6 72.7 70.4 55.6 42.8 34.3 1909 32.8 36.2 38.1 50.1 59.8 71.8 72.0 73.4 64.1 49.9 49.6 25.8 .910. 58.2 26.2 50.1 52.6 57.1 68.4 75.2 73.4 67.6 57.8 37.0 26.5 INTRODUCTION TO MATHEMATICAL STATISTICS Year. Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct. Nov. Dec. 1911 .... 33.5 35.4 38.3 49.0 68.6 72.8 75.7 73.9 68.8 54.1 38.2 37.4 1912 .... 19.2 23.4 34.2 43.4 64.2 68.2 74.9 70.7 68.0 56.2 42.6 34.4 1913 .... 36.8 27.2 40.8 50.4 61.7 71.8 76.5 75.4 65.4 54.4 45.8 35.5 1914 .... 34.0 23.6 36.5 50.7 63.8 72.6 75.9 74.0 64.8 57.7 42.9 27.6 1915 .... 27.8 36.0 34.0 56.3 58.2 68.0 73.0 68.2 68.0 56.7 44.4 31.0 1916... 36.0 26.8 35.5 49.9 62.6 65.9 78.6 75.8 63.9 55.2 43.4 30.4 3. Construct the frequency table of the following data of monthly precipitation at the Columbus Station. Take one-half inch for the class width. Year. 1878 Jan. Feb. Mar. Apr. May. June. July. 3 58 Aug. 5 00 Sept. 2 84 Oct. 3.17 Nov. 3.06 Dec. 3.88 1879 1880 1881 1882 1.66 4.49 2.25 4 69 1.43 1.70 4.44 5 94 3.77 2.42 4.01 4 76 0.92 5.08 2.04 4 87 2.09 3.21 2.00 9 59 2.68 3.30 4.02 6 01 3.67 4.86 5.33 2 62 4.64 6.95 2.09 3 14 2.33 1.80 1.54 2 91 0.23 2.35 8.64 2 44 3.52 4.54 5.35 2 05 4.29 3.98 5.23 2 23 1883 3.20 6.18 3.20 2.85 6 38 4 25 3 75 2.54 2.43 6 11 3.87 4.12 1884 1885 1886 2.25 3.75 4 36 4.95 2.39 1 26 3.59 0.53 3 90 2.11 4.61 3 57 3.79 5.83 7 67 2.59 5.08 2 69 2.16 3.28 4 17 0.70 5.90 2 44 3.46 2.84 3 61 1.66 3.11 1 13 0.99 3.08 4 18 2.77 1.85 3 41 1887 1888 2.35 3.73 6.48 1.30 2.56 3.79 3.44 1.53 2.97 3 89 2.82 1 62 1.45 5 81 2.21 4 34 1.35 91 0.30 3.77 2.45 3 26 1.87 1.11 1889 1890 1891 3.37 5.73 2.84 1.06 6.12 5.42 0.66 5.63 4.64 0.83 4.32 2.25 3.92 5.12 2.73 2.77 4.95 4.98 2.94 1.80 4.69 1.59 2.75 2.64 3.34 7.13 1.05 1.83 3.02 2.94 3.83 1.97 5.44 2.36 2.19 2.42 1892 2.21 3.35 2.23 2.67 3.58 4.96 3.31 5.12 1.47 0.84 2.20 1.60 1893 2 25 7 63 1 92 7 C8 4.81 2.89 1.27 1.65 1 14 3.33 2.16 1.97 1894 1895 1895 2.42 4.67 2 34 3.11 0.64 1 93 1.79 1.23 3 04 1.79 4.12 2 70 2.78 1.73 2 61 1.12 2.94 3 38 1.74 1.45 9 47 2.64 2.10 3 53 5.31 1.48 5 93 1.93 0.92 55 1.91 5.32 3 53 2.95 4.14 1 52 1897 1898 1.54 5.29 3.71 1.67 5.45 7.03 4.27 2.05 3.68 6.04 2.45 1.63 6.95 2.33 1.95 7.16 0.82 1.77 0.36 2.95 7.54 2.30 2.43 1.09 1899 1900 2.35 3.01 1.44 3.30 4.69 2.59 1.18 1.76 2.25 1.82 1.26 2.45 4.85 3.89 1.49 3.02 2.01 0.97 2.23 2.86 1.72 3.71 2.98 0.92 1901 1 50 88 1 82 2 21 4 24 6 31 1 23 1 71 2 10 0.33 59 3.61 1902 1 56 51 2 63 1 60 95 8 52 4 70 1 62 4 16 1 85 2 72 3 41 1903 1901 1905 2.11 2.80 1 25 4.44 8.12 1 57 4.13 4.93 5 87 2.47 2.49 3 15 2.18 4.01 4 38 3.07 3.86 2 78 2.05 2.48 2 27 0.67 3.18 5 45 1.46 0.83 3 36 1.84 0.97 5 45 2.01 0.18 1 64 1.71 3.63 1 87 1905.. . 1 98 1 08 4.59 1.16 2.47 1.44 5.27 6.15 1.59 2.07 2.57 3.33 1907. 5 73 43 5 21 3 27 3.35 3.39 6.07 2.47 2.27 1.59 1.68 1.85 1903 1 40 3 66 6 03 2 75 4 04 2.13 3.74 2.34 42 1.20 0.84 1 59 1909 2 52 4 97 2 68 3 20 4 65 3 88 3 34 2.53 1 81 2 77 1.66 2.58 1910 5 11 5 05 28 2 52 4 10 2.93 2.40 42 3 66 5 22 79 2 31 1911 1912 1913 4.46 1.58 6 63 1.71 1.53 2.09 2.36 4.56 8 09 4.37 4.20 3.91 1.15 2.65 2.60 4.04 1.48 1.56 3.29 3.50 2.88 3.62 2.25 2.10 5.98 2.83 3.28 5.21 1.71 2.05 2.71 1.01 4.56 4.53 2.34 1.13 1914 1915 1916... 2.21 3.30 5.02 3.70 1.52 1.47 2.46 1.19 4.88 2.48 0.95 2.33 1.28 2.57 4.81 2.03 5.06 3.49 1.6! 6.85 O.G5 4.78 7.01 3.22 1.26 4.43 1.54 4.44 0.94 1.84 1.99 1.97 1.58 2.91 4.15 3.59 28 INTRODUCTION TO MATHEMATICAL STATISTICS 4. Study the frequency distributions of population with respect to age; Report of the Thirteenth Census, 1910, Chapter IV, Vol. 1, with special reference to the size of the various class intervals and note two general forms of stating the frequencies of the classes. 5. Examine the different forms of frequency distributions appearing in the report of the Medico-Actuarial Society's Investigations, Vols. I, II, III, IV; also in Biometrika, Agricultural Experiment Station Bulletins and in other accessible sources. 6. In which of the exercises of Chapter I is the data in the fre- quency distribution form ? Plotting a Frequency Distribution. The illustrative data at the beginning of this chapter is plotted by locating 14 equidis- tant points on a horizontal line, one for each height class from 61 inches to 74 inches inclusive. Then at the middle of each interval so obtained a vertical line is erected with a height pro- portional to the corresponding class frequency. In this way t a point is obtained for each class. As in Chapter II, a rectangle is constructed on each in- terval. It must be apparent that a rectangle in the case of the frequency distribution has in every case a significant statis- tical meaning it is the frequency of the class. Hence the sum of the areas of all the rectangles is the total frequency of the distribution. Smoothing the Frequency Curve. With the rectangles drawn, the smoothing of a frequency distribution is in no wise different from the smoothing of the data discussed in the preced- ing chapter. However, for the frequency curve the two rules of the permanence of areas have a stronger justification because of the more definite significance of the areas under the curve. With practice in the construction of statistical diagrams and curves the rectangles may be dispensed with and the curve drawn by inspection, especially when the data contains a large element of uncertainty. Also the broken line obtained by joining the ends of the ordinates, called the freuency poly- gon, may be smoothed by inspection into the required curve. Exercises. 7. Smooth the illustrative data at the beginning of this chapter. 8. Smooth the frequency distribution of Chicago Top Beef Cattle Prices for 50-cent intervals. 9. Tabulate the same data to show the distribution for 25-cent classes. INTRODUCTION TO MATHEMATICAL STATISTICS 29 10. Construct the smoothed frequency curve for the distribution of temperature and of precipitation at Columbus since 1879. 11. From data obtained from a financial paper construct the fre- quency distribution of the prices of preferred stocks for any one market day. 12. Do the same for a common stock. 13. Draw the smoothed curve of the following weight distribution of Ohio State University freshmen. Weight Class 102 107 112 117 122 127 132 137 142 147 152 Frequency 8 13 20 48 76 93 93 110 93 49 56 Weight Class 157 162 167 172 177 182 187 Frequency 31 22 13 11 3 2 9 The weight classes are here of width five pounds and the middle value of each class is taken as the class number. Class 187 includes all persons with weight greater than 184. 14. Construct the smooth curve of the distribution of ages of grad- uates from the Columbus Public Schools. Ages 11 12 13 14 15 16 17 18 Numbers 7 45 186 114 61 8 13. Construct the frequency curve of the preferred stock data of Exercise 11. 14. Do the same for the common stock data of Exercise 12. Use of the Frequency Curve. The frequency curve does not give a chronological picture of the variations in the data. Instead it shows the number of times that each value occurs. The frequency curves of precipitation for a dryer climate is located to the left of that for a more moist climate because months with small precipitation occur more frequently in the dryer region. The frequency curve of higher prices lies further to the right than does that of lower prices, so that by con- structing the frequency curves it can be readily discovered which series of prices tends to be higher. Exercises. 15. Compare the Top Beef Cattle Prices of 1895 with those of 1915. 16. Compare the precipitation at Columbus with that of some other station. Typical or Representative Data. Statistical data may be collected for the express purpose of exhibiting a chronological 3O INTRODUCTION TO MATHEMATICAL STATISTICS or other statement of the variations. This sort of data is usually based on the complete enumeration of a given set of objects, as the census of population to apportion the members of the House of Representatives, or measures of stature for military purposes. In discussing an increase in prices it is impossible to quote all prices; recourse must be had to a carefully selected list of prices. The condition of trade in certain industries is taken as indicative of the condition of all business. In comparing the prices of beef and the prices of corn the real object of investiga- tion is to find an underlying connection between the two series of values a connection which will hold good in any particular year. In such a study the historical statistics of the two price variations are in reality used as representative, as typical, of the manner in which the two prices are related. It is apparent that the frequency form of distribution is peculiarly adapted to typical data. The Errors in Representative Data. The theory of enumerative statistics is simple in statement ; the chief cares of the statistician are that all objects are counted and none counted more than once, and that an adequate and effective method of presentation is adopted. There are also complicated questions of the methods of collecting the data and of the limits of accuracy of the data but these are met with in data of either form. Because it is practically impossible to secure homo- geneous data; that is, data in which the values for all char- acteristics except those under consideration are the same for all variates, representative data must be examined for homo- geneity. For instance, the persons whose heights are tabulated at the beginning of this chapter differ in age, early environ- ment, physical condition, as well as in height so that the given distribution is in reality a distribution of a complex of attributes instead of merely the one attribute, height. Unless the influence of these various factors is carefully studied, serious errors may result from attempting to apply to another distribution the con- clusions drawn from this distribution. It may be shown that, from absolutely homogeneous mate- rial successive samples made up strictly at random, that is, with- out bias or prejudice, will most likely give materially differing INTRODUCTION TO MATHEMATICAL STATISTICS 31 distributions. The extent of such errors must be understood in reasoning from one distribution to another. Hence in working with typical or representative data care must be taken regarding (i) the limits of accuracy of the data; (2) the homogeneity of the data; (3) the errors of random sampling. CHAPTER IV. The Arithmetic Mean. Let us add the January prices in the data of page 25, and then divide the sum by the number of items. The result is $8.33. In this way a number, the arith- metic mean, is obtained. The characteristic arithmetic property of this number is that each of the given data values may be replaced by it without altering the total sum of all the values. It is usual to speak of the arithmetic mean simply as the mean unless, in order to distinguish the arithmetic mean from some other mean, there is special need for the defining word "arithmetic" Exercises. 1. Determine from the date of Exercise 1, Chapter I, the arithmetic mean of the monthly rainfall at Columbus, for March. 2. Determine from the data of Exercise 5, Chapter I, the arithmetic mean of the annual precipitation at Columbus. 3^ Find from the data of page 25 the arithmetic mean of the 1895 Top Beef Cattle prices and compare with the 1915 mean. 4. On the assumption that the population of the United States increased uniformly from 1900 to 1910 find the value of the annual increase and then the estimated population for 1906. 5. Compute the arithmetic mean of the Monthly Top Beef Cattle prices for the years 1895 to 1916. 6. By first assigning each monthly price to the appropriate 50-cent class as on page 25 and computing the arithmetic mean of the prices when so altered determine the effect on the value of the arithmetic mean of substituting the class prices for the exact values. Use the class numbers in the computation and translate the result in terms of the proper interval. 7. In Exercise 5 there are 264 entries in the sum to be added. Show that much of the labor of the addition can be avoided by selecting the equal prices, then multiplying each by the number of times it occurs, and adding the resulting products to obtain the total sum of prices. The results of Exercises 5, 6 and 7 suggest the computing of the mean from a frequency table in accordance with the following rule : multiply each deviation by its frequency, add the resulting products, and divide this total sum by the total fre- quency. The quotient is the value of the mean. Thus, from the frequency distribution of Top Beef Cattle Prices of Chapter III, obtained on page 26 6, 13, 35, 43, 30, 30, 18, 12, 15, 16, 17, 3, 6, 7, i, the mean price is given by the expression (32) d = INTRODUCTION TO MATHEMATICAL STATISTICS 33 1x6 + 2x13 + 3x35 + 4x43 + 5x30 + 6x30 + 7x18 + 8x12 + 252 + 9 x 15 + 10 x 16 + 11 x 17 + 12 x 3 + 13 x 6 + 14 x 7 + 15 x 1 252 = 6.00, where d is the distance of the computed mean from the origin. The mean class is thus 6.00; that is, 6 of the 5O-cent in- tervals. This gives 7.25, the mid-value of class 6, as the mean price. Whenever the frequency table is available the method just described is usually the shortest method for computing the value of the mean. However if the frequency distribution is not needed for any other purpose and especially if an adding ma- chine is at hand the saving of time in the computation of the mean does not ordinarily justify the compilation of a frequency table merely for the one purpose of finding the mean. The following is the computation for mean height from the data at the beginning of Chapter III. , Let us take the origin at height 60. Then the computation scheme will be as follows : Computation of the Mean. Dev. times Class. Deviation. Frequency. Freq. 61 1 2 2 62 2 10 20 63 3 11 33 64 4 38 152 65 5 57 285 66 6 93 558 67 7 106 742 68 8 126 1,008 69 9 109 ' 981 70 10 87 870 71 11 75 825 72 12 23 276 73 13 9 117 74 14 4 56 750 5,925 5,925 d- = 7.9 750 TABLE II. 34 INTRODUCTION TO MATHEMATICAL STATISTICS Hence the mean height is 7.9 classes ; that is, 7.9 inches from the origin, and is therefore equal to 67.9 inches. Statistical Properties of the Arithmetic Mean. What is the statistical significance and interpretation of the arithmetic mean? If a higher price were substituted for one of the January beef cattle prices the resulting arithmetic mean would be larger, but not so much larger as the individual price because in the process of obtaining the mean the price increase is divided by the total number of prices. Hence a larger mean denotes that, as a whole the values of the distribution are greater, and a smaller arithmetic mean is to be interpreted as indicating a relative lower series of values. And since all increases and decreases are to be divided by the number of varieties the changes in the value of the arithmetic mean are relatively smaller than those of the individual values. Thus a decrease of 50 cents must occur in each of the above prices in order to decrease the arithmetic mean by the same amount. A decrease of 50 cents in one-half the variates decreases the arithmetic mean by only 25 cents, and so on. That is, the arithmetic mean is relatively more stable than is an individual measurement. Thus if several groups of 750 students were measured for height and the frequency distribution tabulated and the means computed for each group it would be found that the means would differ but little while the frequency of any one class, 67 inches for instance, would vary considerably from distribution to distribu- tion. It is to be noted that a single increase of 50 cents in the price for one month has exactly the same effect on the value of the arithmetic mean as does a lo-cent increase .in the prices of each of five months. But is this true statistically? Should the exceptionally high price be given so much weight? Should the person of exceptional height be emphasized so strongly in the group of persons whose height is measured? That is, the value of the mean may not always be significant because a part of its value may be due to the presence of un- duly large variates. Whether an item is unduly large can be determined only from a study of the data itself for the mean conveys no information whatever as to the distribution of the variates; it tells only of their general size. That is, the statistical INTRODUCTION TO MATHEMATICAL STATISTICS 35 function of the arithmetic mean is essentially to measure the sice or magnitude of the data as a whole. Theorem. /// any distribution the sum of the deviations from the mean is zero. That is, the sum of the positive devia- tions is equal to the sum of the negative deviations. The distance of the mean from any origin is obtained by taking the sum of the deviations from that origin and dividing by the total frequency, hence when this distance is zero the sum of the deviations must be zero. Weighted Arithmetic Mean. An apparent modification of the arithmetic mean is illustrated by the following: It is desired to obtain an index of food prices by taking the mean of the price quotations of 15 articles of food. It is decided however, that one of the quotations should be given twice the weight of the other articles. This is done by multiplying this quotation by two and taking the double quotation in the total sum. The article is said to have a weight of two. The idea of weight introduces no new principles into the computation of the arithmetic mean. Adjustment or Graduation Formulas. A class of adjust- ment formulas of wide and convenient adaptability to the smoothing of data are based on the arithmetic mean. A series of terms not differing greatly from each other may be smoothed by replacing each by the- mean of the five terms, for instance, of which the given term is the middle term. The distribution obtained by the first adjustment may in turn be similarly smoothed, and indeed the process may be repeated at pleasure. In this way the various graduation formulas of this type are built up. Next to the graphic method this is the simplest method for the smoothing of observations. Extensive application of this method has been made in the graduating of mortality tables, and under the name of the method of the moving average it is often used in smoothing data in which the general trend is obscured by the presence of more or less regular fluctuations. In this case the number of classes grouped together should be determined by the lengths of the cycles of the fluctuations.* If the cycles are irregular in * Sec King, Elements of Statistical Method, sec. 97. Also quarterly Publications of the American Statistical Society, Dec., 1915, and March, 36 INTRODUCTION TO MATHEMATICAL STATISTICS length the method of the moving average is not likely to yield satisfactory results. Exercises. 8. Smooth the data of Table 1, by taking the means of each suc- cessive five terms, then of seven and finally of nine. 9. Apply the method of the moving average to smooth the data of the Top Beef Cattle Prices. Is the method highly applicable to this data? 10. Discuss the reliability of this method for terms at the end of the range. 11. Apply the five term method to the distribution of Ex. 5, Chap. I. The Geometric Mean. Let the price of a certain article for each year from 1910 to 1915 be expressed as a percent of that of the preceding year as follows (assuming 100 for the 1910 price), 100, 105, 118, 109, 102, 115. The percent increase from 1910 to 1915 is obtained by multiplying together the five percents and is approximately 1.58. What uniform percent of increase will give the same percent of increase of 1915 over 1910? Let (i + r) be the constant multiplier or percent. Then we have (i -f r ) 5 = 105 X n8X 109 X 102 X 115. = 1-58415. and (i+r) = 5 V 1.58415* = 1.096. Each of the unequal increases in the series may therefore be replaced by the percent, 1.096, and still give the same product. The population of continental United States in 1910 was 91,972,266; in 1900, 75,994,575- On the assumption of a uni- form rate of increase during the decade what should be the value of this uniform rate in percent? As above, we have (i + r) 10 = 91,972,266/75,994,575 = 1.21025. Hence (i+r) = 10 V 1.21025, = 1.019. It may be noted that according to this method the popula- tion in 1906 was equal to 75,994,575 x (1.019) 6 . The problem in the case of the arithmetic mean is to find a uniform number which, when substituted for each of INTRODUCTION TO MATHEMATICAL STATISTICS 37 the variates, leaves the total sum unchanged. In problems similar to that just preceding it is a matter of finding a num- ber which, when substituted for each of the given numbers, leaves the product of all the numbers unchanged; such a number is called the geometric mean. Exercises. 12. Compute the geometric mean of the following numbers : 2, 4, 8. 13. Compare from exercise 4, the 1906 population on the assump- tion of a uniform annual increase with that obtained from the assump- tion of a uniform annual rate. For any but the simplest problems the computation of the geometric mean cannot be accomplished without the use of logarithms. The following computation of the geometric mean of student heights from the data of page 24 illustrates the process. The geometric mean height = ( 6i 2 62 10 63 11 64 38 65" 66 93 6; 106 68 126 69 109 70" 71 72" 73 98 j^yf and 750 log geo. mean = 2 log 61 + IO l9 62 + ir l9 63 + 38 log 64 + 57 log 65 + 93 log 66 + 106 log 67 + 126 log 68 -(- 109 log 69 + 87 log 70+ 75 log 71 + 23 log 72 + 9 l og 73 + 4 log 74. 1 373 -70355 Hence log. geo. mean = - - = 1.83160 750 and geo. mean height = 67.86. Exercises. 14. Compute the geometric mean of the Monthly Top Beef Cattle Prices. 15. Compete the geometric mean for the March precipitation at Columbus for the years since 1878. Properties of the geometric mean. Unlike the arithmetic mean the geometric mean is most powerfully affected by the smaller deviations because a small factor in a product has a proportionately greater influence on the result of the multiplica- tion than does a larger factor. Each property of the arithmetic mean has a corresponding property for the geometric mean because the logarithm of the 38 INTRODUCTION TO MATHEMATICAL STATISTICS geometric mean is the arithmetic mean of the logarithms of the deviations. From this logarithmic correspondence all the prop- erties of the geometric mean can be derived* from those -of the arithmetic mean. It is apparent, for instance, that the geometric mean applies to a series of deviations multiplied together in a way exactly parallel to that of the arithmetic mean and a series of terms to be added. Other parallels are, a chain of relative prices and a series of price increases ; interpolation on the assumption of a uniform rate and of a uniform increase; compound interest and of simple interest. The Median. Let the years 1879 to 1916 inclusive be arranged in order of the March precipitation beginning with the lowest. We then have with the dataf measured to hundredths of an inch : 1910.. 0.28 1885.. 0.53 1889. . 0.66 1915 1.19 1895.. 1.23 1894 .1.79 1901 . . 1.82 1905. 1.87 1893.. 1.92 1892.. 2.23 1911 2.36 1830. . . ...2.42 1914.. . ...2.46 1887.. . ...2.56 1900.. . ...2.59 1902. . . ...2.63 1909. . , 2.68 1896. . . ...3.04 1883.. . ...3.20 1834.. . ...3.59 1879 3.77 1888.. 3.79 1886. . 3.90 1881 . . 4.01 1903.. 4.13 1912 ... .4.56 1906. . 4.59 1891.. 4.64 1899. 4.69 1882. . 4.76 1904.. 4.93 1907.. 5.21 1897.. 5.45 1S90. . 5.63 1908.. 6.03 1898 7.03 1913.. 8.09. * See Zizek, "Statistical Averages," Chapter III. Also Jevons, "On the Variation of Prices and the Value of the Currency since 1782;' Jour. Roy. Stat. Soc., Vol. XXVIII, 1865. Galton, "The Geometric Mean" in Vital and Social Statistics," Proc. Roy, Soc., Vol. XXIX, 1897, p. 305. McAlister "The Law of the Geometrical Mean," the same, p. 367. Yule, "An Introduction to Statistics," p. 123. f U. S. Weather Bureau Report, Columbus Station, 1017. INTRODUCTION TO MATHEMATICAL STATISTICS 39 The middle year, 1883, in this ordered arrangement is called the median year with respect to March precipitation ; the median precipitation of 3.^0 inches, being that of the median year. In general the median individual is denned as the indi- vidual so located that there are as many individuals with a greater value of the characteristic as with a less value; and the middle value of the measured characteristic is spoken of as tin- median value of the characteristic. If the number of variates is even the medium is assumed to lie between the two middlemost variates. It is obvious that the above median precipitation year might have been obtained by a simple process of counting and inspection of the data without the somewhat laborious process of arrang- ing the variates in order. Exercises. HI. From the data of Exercise 2, Chapter III, determine the median Columbus monthly temperature, and the median year in respect to temperature. 17. From the price data of page 25 determine the median top beef cattle price. 18. From the data of Exercise 19, Chapter I, determine the median price for wheat. *i When the data is in the form of a frequency distribution the computation of the position of the median is much facilitated. All that is necessary then is to start from one extremity of the distribution and include successive classes until half the total frequency is obtained. The only point of difficulty in this case is when the median is located within a class. Then it is necessary to interpolate within the median class for the more exact position of the median. To illustrate the method of interpolation let us find the median student height from the data at the beginning of Chapter III. Half of the number of variates is 375. Counting from the lower extremity \ve find, up to and including class 67, a frequency of 317, so that it is necessary to take 58 individuals from class 68. Hence we may assume that the position of the median will be 58/126 of a unit from the left boundary of class 68. Since this boundary is at 67.5 the median is located at 67.96 inches. Geometricallv, the median deviation locates the ordinate 4O INTRODUCTION TO MATHEMATICAL STATISTICS which divides the area under the frequency curve into two equal parts. Exercises. 21. What is the median point of population as determined by the Bureau of the Census (see pp. 50-52, Vol. L, Report of the 13th Census) ? 22. Distinguish the median point of population from the center of population. Quartiles. Each half of the distribution, one on either side of the median, may be divided into two equal parts. These two points of division are the First and Third Quartiles. The two quartiles and the median thus divide the variates into four classes of equal frequencies. In data having predominately large frequencies near the cen- ter of the distribution the quartiles are relatively close to the median, and in widely scattered data the quartiles are relatively far from the median. This property of the quartiles is developed and applied in the next chapter. The median can be found directly from the cumulative curve by drawing a horizontal line thru the point on the vertical scale corresponding to half the total frequency. The abscissa of the point of crossing of this horizontal line and the curve is the me- dian deviation. Exercises. 19. By drawing the cumulative curve locate the median sudent height. 20. From the frequency distribution of top beef cattle prices of page 25 determine the median price by using the cumulative curve. Deciles. The decile variates are the variates which separate the frequency into ten equal classes. The median is of course the fifth decile but the quartiles are not deciles. The chief use of the deciles, like that of the quartiles, is in determining the shape of the distribution. Exercises. 23. Determine the quartile precipitations from the data of Ex- ercise 5, Chapter I. 24. Determine the decile precipitations from the data of Exercise 3, Chapter II. 25. Determine the quartile and the decile temperatures from the data of Exercise 2, Chapter III. INTRODUCTION TO MATHEMATICAL STATISTICS 4! 26. Determine the quartile prices from the top beef cattle prices of page _'">. 27. Determine the quartile top beef cattle prices from the data in the form of a frequency distribution of the data of page 25. In this problem the quartile prices must be obtained by a process of interpolation similar to that described for the median. Statistical Properties of the Median. The value of the median ordinate depends not on the actual values of the variates but solely on the relative values. The data need be given with only enough exactness to permit the arrangement of the variates in order with respect to the attribute considered. Moreover, it is only the arrangement near the median value that must be care- fully attended to, consequently the median can not give detailed information of the variates at the extremities of the ranges. There is apparently no apriori reason why the value of the median should not show considerable variation from sample to sample taken from the same material, but in practice it is found that the median shows as high if not higher degree of stability than does the arithmetic mean. Thus if a second group of 750 students were measured as to height and the median computed it would most likely be found to differ only slightly from that of the group already discussed. This slowness of change in the median means that the median is not greatly affected by the presence of accidental and irrelevant influences. That is, dif- ferences in the value of the median are not likely to be merely accidental and hence the median measures significant properties of the material. For instance, a distribution of wages showing a higher median wage must be significantly a group of higher wages. The properties just discussed together with the fact that the median can be located by the simple process of counting renders the median a highly important average in practical statistical work. The Probable Deviation. The median variate divides the data into two classes of equal frequencies. Hence it is an even chance that an individual selected at random will fall into a desig- nated one of the two classes. If the median height of freshmen students is 68 inches it is an even bet that a student concerning* whose height nothing is known has a height less than 68 inches. Likewise it is an even bet that a student selected at 42 INTRODUCTION TO MATHEMATICAL STATISTICS random will have a height between the first and third quar- tiles. The range from the median to the third or first quartile, one-half of the range within which the chances are even for an individual measurement to He, is called the probable deviation.* Exercises. 28. Determine the probable deviation for top beef cattle prices. 29. Determine the probable deviation for monthly precipitation at Columbus; for monthly temperatures at the same station. 30. Show that the probable deviation is necessarily connected with the frequency distribution and not with a chronological distribution. The Mode. Notice that, in the frequency distribution of student heights, class 68 has the greatest height and that the high point on the frequency curve is within the same class. The class of greatest frequency is called the modal class and the deviation with the highest ordinate the modal deviation. A mode is thus defined as a class or deviation of greatest fre- quency; more accurately, it is the class or deviation of greater frequency than that of either the class immediately greater or immediately less. This second definition allows for distributions having more than one mode. Exercises. 31. From the smoothed frequency curve of the data of page 27 determine the modal monthly precipitation. 32. Determine the modal March temperature for Columbus. It is possible to locate the mode within a class by a process of interpolation similar to that described in the determination of the median but by far the easiest method is to construct the smooth frequency curve and determine the abscissa or deviation of the greatest ordinate. When the data seems to have more than one mode care must be exercised in deciding whether to smooth out the apparent modes. In the frequency distribution of monthly temperatures it is evident that there are summer and winter modal temperatures. The telephone-calls data of Exercise 33 below shows more than one mode. On the other hand the data of age distribution reported by the United States Census Bureau * Certain qualifications of this definition are discussed in Chapter V. INTRODUCTION TO MATHEMATICAL STATISTICS 43 shows a tendency for the frequencies at the even ages to be larger than at the odd ages. This latter tendency is partly due to the fact that persons who are uncertain as to their exact age seem to show a preference for an even number. These apparent modes should be smoothed out. Data with essentially one mode is said to be unimodal; with more than one mode, multimodal. Exercises. 33. Smooth the following data of the telephone calls for one day at a business exchange* and locate the modes. Time .... 6-7 7-8 8-9 9-10 10-11 11-12 12-1 1-2 2-3 Calls .... .1595 3430 6389 6904 7282 7358 6361 5659 6186 Time .... 3-4 4-5 5-6 6-7 7-8 8-9 9-10 10-11 11-12 Calls .... 6597 6510 6093 4508 4210 2289 1197 916 314 Time .... 11-12 Calls .... 12 34. Do the same for the following residence calls.** Time .... 6-7 7-8 8-9 9-10 10-11 11-12 12-1 1-2 2-3 Calls .... 1256 3796 6604 4098 4240 3816 5852 4421 3136 Time .... 3-4 4-5 5-6 6-7 7-8 8-9 9-10 10-11 11-12 Calls .*'.. 4344 3267 4-541 4778 4039 2088 1176 655 187 35. Determine the modal classes for the top beef cattle prices. Statistical Properties of the Mode. Because the neces- sary modifications are easily made for multimodal data the prop- erties of the mode are here discussed only for a unimodal dis- tribution. Since the modal class or deviation is that of greatest fre- quency ; that is, since more variates belong to that class than to any other, the mode is the most typical of all the variates of a distribution. If any one variate is to be selected as decrip- tive of the data the modal variate should be that variate. The mode is accordingly said to define the type of the dis- tribution. The significance of the mode as a type depends, of * By permission of Central Union Telephone Company, Columbus. M;iin Exchange. ** Same, North Exchange. 44 INTRODUCTION TO MATHEMATICAL STATISTICS course, on the relative preponderance of its frequency. Thus the frequency of height 68 in the case of the student dis- tribution of page 24 is 126 and the combined frequency of the classes near the modal class is a large percent of the total fre- quency. In the beef cattle prices of page 25 the modal class has a frequency of 43 and there is not as rapid falling off in the frequency on either side of this class as is shown by the height data. Hence in the price data the mode does not have as great significance as it does in the height data. Data show- ing a strong tendency to concentrate about the mode is said to be highly stable or true to type. Measures of trueness to type are discussed in the following chapter. The position of the mode depends only on the values of a few variates so that the mode like the median gives little infor- mation of the extremes of the range. The mode cannot be accurately determined by a simple process of arithmetic as can the median and the mean. The mode being the predominating value, the type, the fash- ion, it is what is ordinarily in the popular mind when an average is spoken of. The statement that the average person spends one- third of his income in rent is most likely to mean that more per- sons spend about that per cent than any other per cent. Exercises. 36. Determine the modal class for each frequency distribution of Chapter III. 37. Show that the concept of mode does not apply to a curve of the historical type. CHAPTER V. THE FORM OF A DISTRIBUTION. Dispersion. It is stated in the preceding chapter that the significance of the mode as a representative of the data de- pends on the extent to which the data conforms to the mode as a type. That is, if the sum of the frequencies near the mode is a relatively large per cent of the total frequency the modal devia- tion is highly typical and the data is not highly variable. The word variable is used because, if in the data a certain type does not predominate, different samples will have a tendency to show widely differing distributions. If, to illustrate, the modal fre- quency of a second distribution of the heights of 750 students is only 95 with a similar reduction in the other larger frequencies, this second distribution is not so true to the type expressed by the mode as is the first distribution. To repeat, a distribution with small frequencies at the ends of the ranges and with the frequencies concentrated at a point is said to be true to type, to be highly stable. Let us investigate various methods of measuring the extent to which the data is scattered or dispersed about the class of concen- tration. Measures of Dispersion. Because the breadth of the range depends on the usually uncertain data at the extremes it does not furnish a reliable measure of the extent to which the data is spread-out. As given on page 24 the range of student heights is 14 inches ; the inclusion of a single student of height 58 inches would increase the range by more than twenty percent. We have seen that in theory the dispersion should be meas- ured from the mode but in practical statistical work the mean, median and mode differ so little in position that it is ordinarily permissible to measure the disperson from the mean. The sum of the deviations about the mean is useless as a measure of dispersion because, as was proved on page 35, this sum is zero regardless of the spread or dispersion of the dis- tribution. Mean Deviation. Since the object in measuring disper- (45). 46 INTRODUCTION TO MATHEMATICAL STATISTICS sion is to determine the divergences of the variates from an average it is the amount of a divergence that counts and not its direction. Hence a logical measure of dispersion is obtained by adding the divergences, all counted positive, and then divid- ing the sum by the total frequency. This gives the mean deviation. The form for the computation of the mean deviation is the same as for the arithmetic mean except that all deviations are measured from the mean, median or mode, whichever is chosen for the origin, and all negative signs are disregarded. Exercis3s. 1. Compute the mean deviation from the arithmetic mean of the Student Height Data of page 24. Referring to the computation for the arithmetic mean on page 88, let us add a column obtained by taking the difference between the mean and each deviation and then multiply these differences by the respective frequencies and add the resulting products. This sum is then divided by the total frequency in order to obtain the mean deviation. We thus have: Computation of the Mean Deviation. Class Xo. Diff, Freq. Prod. 1 6.9 2 13.8 2 5.9 10 59.0 3 4.9 11 53.9 4 3.9 38 148.2 5 2.9 57 165.3 6 1.9 93 176.7 7 0.9 106 95.4 8 0.1 126 12.6 9 1.1 109 119.9 10 2.1 87 182.7 11 3.1 75 232.5 12 4.1 23 94.3 13 5.1 9 45.1 14 6.1 4 24.1 750 il,473.5 1.9 Mean deviation = 1 .9 classes.. Since each class interval is one inch the mean deviation is 1.9 inches.. TABLE III. INTRODUCTION TO MATHEMATICAL STATISTICS 47 2. Compute the mean deviation about the arithmetic mean of the price data of page _">. 3. Compute the mean deviation about the median of the price data of page 25 and compare the result with that of Exercise '2. 4. Compute the mean deviation about the arithmetic mean of the precipitation data of Exercise o, Chapter I, and of the temperature data of Exercise 6 of the same chapter. 5. From the frequency tables of Exercises 2 and 3 of Chapter III compute the mean deviation of monthly precipitation and of monthly temperature. For purposes of comparing the stability of different distribu- tions it is desirable to divide the mean deviation by the mean or median, whichever is used. When this is done the mean deviation is expressed as a fraction of the base average. For instance, it seems reasonable to say that a mean deviation of 0.3 with an arithmetic mean of 20 has the same significance as a mean devia- tion of 0.9 based on an arithmetic mean of 60. Exercises. 5. Compare the dispersions in Exercises 1, 2, 3, 4. Because, as is presently proved, the mean deviation is least when taken about the median it is theoretically best to compute the mean deviation about that average. When so done there is a certain degree of standardization which is not attained with any other average as a base, but the point is not of great practical im- portance unless the median and the arithmetic mean differ markedly. Proof that the mean deviation is rmallest when taken about the median. Let P be a point on the line S-T between the points A and B. The sum of the deviations of P from A and B is, without regard to the sign of the negative deviation PA, PB + PA, and this sum is equal to AB. If P should lie without the segment AB the sum of the two deviations would be greater than AB. Likewise the sum of the distances of P from any other two points C and D is least when P lies between them. Hence the total sum of deviations of P from any number of points is least when there are as many points on one side of P as on the other ; that is, when P is the median of the points. S ACE PB DF T 48 INTRODUCTION TO MATHEMATICAL STATISTICS Exercises. 6. According to the measure supplied by the mean deviation which is the more variable, the monthly mean temperature or the monthly mean precipitation at Columbus? 7. From the data of heights on page 24 and the data of weights of Exercise 13, Chapter III, determine which is the more variable, student height or student weight. Statistical Properties of the Mean Deviation. The mean deviation as a measure of dispersion has all the properties of a mean it takes all the variates into account ; it takes each variate according to its size and consequently may give more prominence to extreme variates than their statistical importance may warrant ; it is computed by a simple process of arithmetic. Because in forming it only the numerical values of the deviations are used and all distinctions between positive and negative deviations are disregarded the mean deviation is not well adapted to certain statistical purposes for which the standard deviation, to be next discussed, is preeminently fitted. Altogether the mean deviation is an index of dispersion of practical importance and should ordinarily be used either alone or in connection with other measures. The Standard Deviation. The mathematically simplest device for eliminating negative signs is by squaring the terms. Hence if the difference between each deviation and the mean be squared, the sum of the squares added and the resulting sum divided by the total frequency the mean squared devia- tion thus obtained, is a measure of dispersion which is arith- metically more convenient than is the mean deviation. The computation of the mean squared deviation differs from the computation of the mean deviation, which is illustrated under Exercise i, only in that the deviation differences are squared be- fore multiplication by the frequencies. It is of course possible to compute directly from the data without using the frequency table but only a slight error is introduced by the combining of the actual values into reasonably narrow classes and much labor is ordinarily saved because only one multiplication is then re- INTRODUCTION TO MATHEMATICAL STATISTICS 49 quired for each class instead of for each individual variate as is necessary if the frequency distribution is not used. Exercises. 6. Determine the mean squared deviation about the arithmetic mean of the data of Student Heights. 7. Do the same for the Prices of Top Beef Cattle. tf. Do the same for Monthly Precipitation at the Columbus Station. !>. Do the same for Monthly Temperatures at the Columbus Station. The above method of computing the mean squared deviation involves fractional differences in the deviations. By the follow- ing modification fractions can be avoided. Short Rule for the Mean Squared Deviation. Select an integral deviation near the actual arithmetic mean and find the difference between each deviation and this selected deviation. Square each of the differences so obtained, multiply by the cor- responding frequency, add> and divide by the total frequency. The result is the mean squared deviation from the selected value. To obtain the mean squared deviation from the arithmetic mean all that is necessary is to subtract from the value just com- puted the square of the difference between the true arithmetic mean and the selected integral value. If the mean squared deviation about the actual arithmetic mean is denoted by the Greek letter a-, (sigma), and the mean squared deviation about any other point by the same symbol written with a prime, o-'; we have, on recalling that the letter d is used to denote the deviation of the arithmetic mean from the origin, the following formula : To prove this formula let the deviations from the original origin be denoted by X and the deviations from the arithmetic mean by .r and let the distance of the mean from the original origin be denoted by d. Then X = x + d for each individual in the distribution and x = X d. The standard deviation is obtaineed by squaring each x and dividing by the total frequency. Performing these operations we have 5O INTRODUCTION TO MATHEMATICAL STATISTICS N ... -2d X X N Na' 2 = Nd-/N = . Which is the more variable, the standard deviation of Student Heights or of Weights? Statistical Significance of the Probable Deviation. The statistical application of the probable deviation may be illus- trated by the following questions : The mean height of a group of students is 67.9 with a probable deviation of 1.78 inches. The height of a student taken at random from a second group is 72 inches. What is to be concluded ? That the two groups are taken from essentially the same populations or that they 'all are taken from distinctly different populations? That is, how many times may a deviation exceed the probable deviation and still be assumed to come from the same material ? It must be apparent that this is a fundamental question in statistical analysis. Further discussion of it is deferred to the following chapter. The Deciles as Measures of Dispersion. The position of the deciles shows the spread of the variates in the distribu- tion. If the deciles near the middle of the distribution are close together and the deciles near the beginning and the end of the ranges are far apart the distribution is highly variable and not true to type. Because there are nine decile positions to observe in a distribution the decile is not so simple a measure of dispersion as is the quartile or standard deviation, tho this very fact of greater detail may in some cases be of advantage. Exercises. '21. By the use of the deciles compare the variability of monthly precipitation at the Columbus Station with that of monthly temperatures at the same station. Symmetrical and Asymmetrical Distributions. The curve of student heights is essentially of the same shape to the right of the highest point as it is to the left. It is a symmet- rical curve. (Fig. VI.) Statistically the fact of symmetry means in this case that there is no tendency for the students to be either tall or short; that there is no selection between the tall and the short ; that the chances for a tall person to belong to the student group are equally as good as those of a short person ; that there is absolutely no connection between being a member of this student group and being tall or being- short. INTRODUCTION TO MATHEMATICAL STATISTICS FIG. VI. A Symmetrical Curve. On the other hand the curve of height of the members of a police force would have a longer range to the right than to the left because extremely short persons are excluded. The curve in this case is said to be asymmetrical. Asymmetry in a curve denotes the presence of selection in the data; of a de- pendence; of an expressed preference for certain values of the attribute. \ FIG. VII. An Asymmetrical or Skew Curve. Exercises. 28. Examine each frequency curve of Chapter III for symmetry and discuss the significance of each case of asymmetry. The Position of the Averages and Asymmetry. In the symmetrical curve the mean, median and mode coincide. The cutting off of the range to the left tends to move the mean to the right because the longer deviations are to the right, and it has been seen that the mean is most affected by the longer or extreme deviations. This places the median at the left of the mean. The mode will tend to be moved to the left of INTRODUCTION TO MATHEMATICAL STATISTICS $? the median because both of the effect of the moving of the mean to the right and of the shortening of the left range with a con- sequent heaping up of the frequencies within the left half. The result is that the three averages are then in the order mode, median, mean. It has been verified experimentally that for moderately asymmetrical distributions the distance of the median from the mode is about one-third the distance of the mean from the mode. Skewness. An asymmetrical curve is said to be skew. Skewness is positive when the longer range is to the right and negative when the longer range is to the left. Measures of Skewness. Since the mode and mean are separated to an extent depending on the degree of skewness present, a logical measure of skewness is the difference between the mean and the mode. Because a large difference between the positions of the mean and the mode in widely spread-out data may not be so significant as a smaller difference in highly con- centrated data it is advisable to divide this difference by the standard deviation. Hence we have, Mean Mode Skewness = Exercises. 29. Compute the skewness of the following data of incomes : Estimated Distribution of Income among the Single Women of Continental United States in 1890. (King, Wealth and Income, p. 224) Class 0-$200 $200-$300 $300-$400 $400-$500 $500-$600 No. in Thousands. 10 70 560 530 280 Class $600-1700 $700-$800 $800-$900 $900-$1000 No. in Thousands. 150 120 37 22 Class $1100-$1200 $1200-$1300 $1300-$1400 No. in Thousands. 12 8 5 30. Show that the above formula for skewness correctly indicates the sign of the skewness. A Second Measure of Skewness is obtained as follows: Any measure of skewness must take into account the distinction between positive and negative deviations. The total sum of 58 INTRODUCTION TO MATHEMATICAL STATISTICS deviations from the mean is zero regardless of the form of the distribution ; the standard deviation involves the deviations as squares and hence obliterates the distinction between positive ang negative deviations. The mean cubed deviation, however, will serve as a measure of skewness. The longer deviations to the right, if the skewness is positive, will be more powerfully affected by the operation of cubing than will the shorter devia- tions to the left and hence the total sum of cubed deviations will be positive. It is well to extract the cube root of the mean cubed deviation and then in order to express the skewness as a fraction of the spread of the distribution to divide the result by the standard deviation. Exercises. 31. From the computation form of Exercise 1 compute, in accord- ance with the second method, the skewness for the student height distri- bution. 32. Do the same for the distribution of incomes. CHAPTER VI. THE NORMAL PROBABILITY CURVE. The Equation of a Frequency Curve. As discussed in Chapter II, a smoothed curve is a graphic estimate of what would be the course of the data if it could be freed from acci- dental variations. The smoooth curve is therefore the geometric representation of a law of connection or variation. It shows, for instance, the variation of temperature with the seasons ; the tendency for precipitation to depend on the month of the year ; the most likely percent of students at each height. The presence of an underlying law of connection in the data implies the presence of an algebraic law connecting the x and the y coordinates. The algebraic statement of the law expressing y in terms of x is called the equation of the curve. If the equation is given, the ordinate can be computed for any abscissa and hence the curve can be located by plotting a sufficient number of computed points. In some distributions it is possible to discover a law of connection directly from the data, and then without an extended computation to translate this law into the proper algebraic form. We shall discuss in this chapter the equation of only one type of curve the normal curve. This form of curve is suited to the representation of a large class of distributions. And the theory of the normal curve can be made use of in the determination of the probable deviation and in the dis- cussion of certain other properties even for a distribution to which it does not apply with sufficient accuracy to be adopted as the form of the smoothed curve. Statistical Theory of the Normal Curve. The height of a person is the resultant sum of a large number of elements such as the length of certain bones,* the widths of cartilages, the erectness of posture. And, in general, any statistical data can be analyzed into elemental components. Whenever these ele- mental values are relatively small in comparison with the result- ant values and at the same time each element is equally likely to take any value within a small ran(/e, then the resultant data is said to be normally distributed. (59) 6o INTRODUCTION TO MATHEMATICAL STATISTICS With an absence of selection, as is assumed, it is reasonable to conclude that the resulting distribution will be symmetrical. And it is also apparent, after some consideration, that the fre- quencies at the center will be high and those at the ends of the range very small. It may be noted that in order to have a nor- mal distribution it is not at all necessary that it be possible to actually compute the values of the elemental factors; it is only their existence under the above assumptions that is predicated. The Equation of the Normal Curve. It can be mathe- matically demonstrated that the equation of the Normal Curve is. N where TV is the total frequency of the distribution; & 2_ /2 61 62 63 64 65 66 67 68 69 70 71 72 73 74 To'ls 102 i .. .. 1 2 * 3 1 1 8 107 .. 31 5 2 1 1 .. 13 , ^^$'2 O 7 3 3 3 2 .. 20 1 "T117 2 2 If) 9 6 6 7 2 2 2 .. 48 122 .. 14 2 12, 17 If! 14 4 5 1 .. 76 127 1 1 1 7 7 11 15 16 18 9 5 2 .. 93 1-7 H2 .. 2 .. 4 9 T8 18 17 8 8 4 3 1 1 93 1 137 1 .. 3 4 14 20 2-1 21 11 9 2 1 110 142 .. 7 12 10 17 17 8 15 5 2 .. 95 y. 1 17 a 3 7 5 12 9 8 3 49 I 152 .. 2 2 3 14 10 12 11 i 1 56 157 4 1 6 7 5 7 1 .. 31 162 1 .. .. o 2 3 8 2 2 2 .. 22 167 1 2 6 1 2 1 .. 13 172 1 1 1 6 2 . . .. 11 177 1 1 1 3 182 .. 1 1 2 187 1 3 2 2 1 9 Totals 2 10 11 :>,x 58 93 106 126 n 109 87 75 23 9 4 750 TABLE VIII (67) _/ 68 INTRODUCTION TO MATHEMATICAL STATISTICS The writing of the distributions in this compact tabular form greatly facilitates the study and comparison of the different distributions. Exercises. 1. Notice that there is a decided increase in weight with an increase in height ; that there are no extremely tall persons in the group who are at the same time extremely light in weight; that there are practically no persons who are both short and extremely heavy. 2. Note that there is a closer connection between height and weight for the shorter and lighter individuals than for persons with medium values of the two characteristics. The Construction of a Correlation Table. Let us con- struct the correlation table of monthly precipitation and monthly mean temperatures for the Columbus Station. The data is given under Exercises 2. and 3 of Chapter III. Let the horizontal scale refer to temperatures and let each class of this scale have a width of five degrees. The vertical scale will then refer to precipitation and let the width of classes be taken as one-half inch. The scales are written across the top and down the left hand margin respectively in order to leave room for the sum- mations across the bottom and down the right hand margin. Under this arrangement of the scales y increases in value from top to bottom and hence the positive direction for y is downward. In constructing the table it is convenient to rewrite the data according to classes and at the same time to combine the two distributions. There is no need for retaining the dates but care must be taken that the measures from exactly the same months are written together. This is done by starting with January, 1879, and proceeding with the Januarys and then February, 1879, and so on in order. The temperature figures are written first in each pair of numbers, and the lower limit is written as the class number of each class. Thus ** re f e rs to a month with * a mean temperature from 25 to 29 degrees inclusive and with a precipitation from 1.5 to 1.9 inches inclusive. In this way there will be built up a table of the following form : 25 40 20 30 25 20 20 20 25 25 1.5 4.0 2.0 4.5 3.0 2.0 3.5 4.0 2.0 3.5, etc. INTRODUCTION TO MATHEMATICAL STATISTICS Next the rulings must be made for the table. The tabula- tion proceeds in the following manner: for the first pair of numbers find the 25 column and drop down this column to the precipitation class 1 . 5 and mark a score ; then to the 40 column and down to the 4.0 class and tally; then to column 20 and precipitation class 2.0; etc. The diagram of tallies, usually dots, is called the Scatter Diagram. Correlation Table of Temperature and Precipitation. PRECIPITATION IN INCHES. 15 20 25 30 35 40 45 50 55 60 65 70 75To'ls 0.5 .. 3 .. 2 2 1 1 4 1 1 1 3 1 20 1.0 .. 2 4522.. 333443 35 1.5 1 2c 478546645 10 4 66 2.0 2.5 2- I-- .. l 5 5 6 6 (6\ 5 3 3 6 7 7 65 5 S 3 3^ f) C\ SrlS 2 61 l/j 3.0 1 5 7 2 < .. \$ ^M N$iX-S & 43 OS 3.5 4.0 1 .. 1 522, 6V 2 a/ 1 4 3 4 6 38 2 2 ^ & .. 2 4 2 4 2 >/ 1 30 u 4.5 .. 1 .. 6331212.. 32 24 5.0 2.. 441211561 27 w 5.5 .. .. 3 1 .. 1 .. 2 1 3 .. 11 H 6.0 .. 1 .. 2 .. .. 1 1 .. 3 1 9 6 5 2 ... 1 3 u 7.0 1 1 .. 1 1 12 7 H H 7.5 . . .. 1 .. 1 1 3 > 8 1 1 8 5 11 2 9 .. .. 9 5 1 .. 1 Totals 3 16 31 43 44 42 21 44 30 35 39 73 35 456 TABLE IX. Table IX, the correlation table, is made from the Scatter Diagram by inserting the frequencies in the place 'of the tallies. Exercises. 3. Do wet months uniformly occur with warm months? or is there more of a tendency for wet and cold or cool months to be associated? 4. What may be said as to the tendency for dry and warm months to be associated? for dry and cool mojnths? 7O INTRODUCTION TO MATHEMATICAL STATISTICS 5. Does there seem to be as close a connection between precipitation and temperature as between height and weight? 6. Is it not possible that the real connection between precipitation and temperature in this table is obscured by the fact that data for all four seasons is thrown together? Explain. Definitions and Symbols. The properties, as height and weight or temperatures and precipitation are called the attri- butes or characteristics. The horizontal deviations are called the x classes or deviations, and the vertical, the y classes or deviations. Each subclass or subgroup thus has a value of x and of 3; associated with it. It is convenient to number the x and y classes from left to right and from top to bottom, respectively, and use these numbers for class numbers instead of the actual class values. Thus there are 17 persons with height 66 inches and weight 122 pounds ; and 4 months with a mean temperature of from 40 to 45 degrees and a precipitation of from 3.0 to 3.5 inches. In terms of x and y, the subclass x =6, y = 5 has a frequency of 17 ; the subclass x= 5, 3; = 6 has a frequency of 4 months. The columns and rows are spoken of as arrays; the col- umns as y-arrays of type x and the rows as x-arrays of type y. Or the concrete names of the data may be given to the arrays - the weight array of height 67 inches ; the height array of weight 132 pounds; the precipitation array of temperature 40 degrees, etc. It should be noted that the weight array of height type 67 inches is the distribution with respect to weight of the persons having a height of 67 inches ; the precipitation array of type 40 degrees is the precipitation distribution of the months having a mean temperature of 40 degrees. A /y array of type x and an x array of type y are said to be arrays of opposite sense. Two y arrays or two x arrays are arrays of the same sense. The frequency of a y array is denoted by the symbol w x where x is the type of the array. The frequency of an x array is denoted by the symbol n y , where 3' is the type. The frequency of a subclass is denoted by the symbol n xy , where x and y are the deviations of the subclass ; that is, the types of its two arrays. Thus, w 61 = 2 ; w 132 = 93; w 6tf . J42 = 12, or if the simpler class numbers are used, n.^ = 2; n. 7 = 93; (! . = 12. When the lat- INTRODUCTION TO MATHEMATICAL STATISTICS 71 ter form of class numbers is employed it is necessary to dis- tinguish, between x and y class numbers by means of a colon. Sometimes the distinction between x and y deviations or class numbers is made by the use of subscripts as X1 y2 . Exercises. 7. Write the values of x for x 2, 4, 9 in the precipitation data. 8. Write the values of n-i- A ii : for both the height-weight and the precipitation-temperature data. 9. Practice stating the frequencies of the various arrays and sub- groups; e.g. the frequency of the weight array of type 8 (68) is 126. 10. Note that ;ii :7 + "^' + H + .... MM* = n : i = 93, for the height-weight data. 11. Write other statements in the form of that of Exercise 10. The mean of the vertical column of totals is called the mean of all the weights, and in general, the mean of all the y's ; and is denoted by the symbol 3". It is the mean of the vertical deviations of the variates when unclassified with respect to the horizontal attribute ; the mean weight for all heights ; the mean monthly precipitation disregarding temperature ; the mean monthly precipitation for all temperatures taken together. Likewise, the mean of all the x's is denoted by the symbol .v. The means of the weight arrays are denoted by the sym- bols, j; fll , 5/ 62 , y 63 . In general the mean of the y-array of type x is denoted by the symbols y x . The mean of the x-array of type y is denoted by the symbol ,r y . Exercises. 12. From the following data construct the correlation table of top hog and top beef cattle prices at Chicago. 72 INTRODUCTION TO MATHEMATICAL STATISTICS Chicago Monthly Top Hog Prices. Years. Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct. Nov. Dec. 1916. $ 8.10 $ 8.90 $10.10 $10.05 $10.35 $10.15 $10.25 $11.55 $11.60 $10.35 $10.35 $10.80 1915 7.40 7.25 7.05 7.90 7.95 7.95 8.12 8.05 8.50 8.95 7.75 7.10 1914 8.00 8.90 9.00 8.95 8.67 8.50 9.30 10.20 9.75 9.00 8.25 7.75 1913 7.80 8.70 9.62 9.70 8.85 9.00 9.62 9.40 9.65 9.10 8.30 8.15 1912 6.70 6.57 7.95 8.20 8.05 7.30 8.50 9.00 9.27 9.42 8.30 7.85 1911 8.30 7.90 7.35 6.90 6.50 6.72 7.55 7.95 7.80 6.90 6.72 6.60 1910 9.05 10.00 11.20 11.00 9.35 9.80 9.60 9.70 10.10 9.65 8.70 8.10 1909 6.70 6.95 7.15 7.60 7.55 8.20 8.45 8.32 8.60 8.40 8.45 8.75 1908. 4.72 4.70 6.35 6.45 5.90 6.67 7.10 7.10 7.60 7.20 6.40 6.15 1907. 7.05 7.25 7.10 6.90 6.65 6.42 6.65 6.72 7.00 7.00 6.32 5.30 1906. 5.72 6.42 6.55 6.82 6.67 6.85 7.00 6.75 6.82 6.85 6.50 6.55 1905. 5.00 5.12 5.55 5.72 5.65 5.70 6.17 6.45 6.20 5.80 5.25 5.35 1904 5.20 5.30 5.82 5.50 4.95 5.40 5.90 5.80 6.37 6.30 5.25 4.87 1903. 7.10 7.65 7.87 7.65 7.15 6.45 6.10 6.20 6.45 6.50 5.50 4.90 1902 6.85 6.60 6.95 7.50 7.50 7.95 8.25 7.95 8.20 7.92 6.95 6.80 1901 5.47 5.65 6.20 6.25 6.05 6.30 6.40 6.75 7.37 7.10 6.30 6.90 1900 4.92 5.10 5.55 5.85 5.57 5.42 5.55 5.57 5.70 5.55 5.12 5.10 1899 4.05 4.05 4.00 4.15 4.05 4.00 4.70 5.00 4.90 4.90 4.35 4.45 1898 4.00 4.27 4.17 4.15 4.80 4.50 4.17 4.20 4.15 4.00 3.85 3.75 1897 3.60 3.75 4.25 4.25 4.05 3.65 4.00 4.55 4.65 4.40 3.80 3.60' 1896 4.45 4.35 4.35 4.15 3.75 3.60 3.70 3.75 3.50 3.65 3.67 3.65 1895 4.80 4.65 5.30 5.42 4.97 5.10 5.70 5.40 4.65 4.50 3.85 3.75 Chicago Monthly Top Beef Cattle Prices. Years. Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct. Nov. Dec 1916 $9.85 $ ! 9.75 $10.05 $10.00 $10.90 $11.50 $11.30 $11.50 $11.50 $11.65 $12.40 $13.00 1915 9.70 9.50 9.15 8.90 9.65 9.95 10.40 10.50 10.50 10.60 10.55 11.60 1914 9.50 9.75 9.75 9.55 9.60 9.45 10.00 10.90 11.05 11.00 11.00 11.40 1913 9.50 9.25 9.30 9.25 9.10 9.20 9.20 9.25 9.50 9.75 9.85 10.25 1912 8.75 9.00 8.85 9.00 9.40 9.60 9.85 10.65 11.00 11.05 11.00 11.25 1911 7.10 7.05 7.35 7.10 6.50 6,75 7.35 8.20 8.25 9.00 9.25 9.35 1910 8.40 8.10 8.85 8.65 8.75 8.85 8.60 8.50 8.50 8.00 7.75 7.55 1909 7.50 7.15 7.40 7.15 7.30 7.50 7.65 8.00 8.50 9.10 9.25 9.50 1908 6.40 6.25 7.50 7.40 7.40 8.40 8.25 7.90 7.85 7.65 8.00 8.00 1907 7.30 7.25 6.90 6.75 6.50 7.10 7.50 7.60 7.35 7.45 7.25 6.35 1906 6.50 6.40 6.35 6.35 6.20 6.10 6.50 6.85 6.95 7.30 7.40 7.90 1905 6.35 6.45 6.35 7.00 6.85 6.35 6.25 6.50 6.50 6.40 6.75 7.00 1904 5.90 6.00 5.80 5.80 5.90 6.70 6.65 6.40 6.55 7.00 7.30 7.65 1903 6.85 6.15 5.75 5.80 5.65 5.15 5.65 6.10 6.15 6.00 5.85 6.00 1902 7.75 7.35 7.40 7.50 7.70 8.50 8.85 9.00 8.85 8.75 7.40 7.75 1901 6.15 6.00 6.25 6.00 6.10 6.55 6.40 6.40 6.60 6.90 7.25 8.00 1900 6.60 6.10 6.05 6.00 5.85 5.90 5.85 6.20 6.15 6.00 6.00 7.50 1899 6.30 6.25 5.90 5.85 5.75 5.75 6.00 6.65 6.90 7.00 7.15 8.25 1898 5.50 5.85 5.80 5.50 5.50 5.35 5.65 5.75 5.85 5.90 6.25 6.25 1897 5.50 5.40 5.65 5.50 5.45 5.30 5.25 5.50 6.00 5.40 6.00 5.65 1896 4.00 4.75 4.75 4.75 4.55 4.65 4.60 5.00 5.30 5.30 5.45 6.50 1895 5.80 5.80 6.60 6.40 5.25 6.00 6.00 6.00 6.00 5.60 5.00 5.50 13. From the data of Exercise 12, construct the correlation table of hog prices and months of the year. 14. From data obtained from a financial journal construct a correla- tion table of the prices of common and preferred stocks. INTRODUCTION TO MATHEMATICAL STATISTICS 73 15. In the correlation table of Exercise 12 does there appear to be a sharp tendency for the beef cattle arrays to vary with the changing live hog prices? Is the tendency more pronounced at some parts of the table than at others? 16. Compare the tendencies for close connection between the at- tributes in the table of Exercise 13 with that in Exercise 12. Correlation. In the table of student heights and weights there is a decided tendency for heaviness and tallness to be associated and for lightness and shortness to be associated. There is likewise a pronounced tendency for the prices of live hogs and beef cattle to vary together. It is to be noted that the two series of measurements do not vary together in every case ; that is, there are months in which the price of hogs is low but the price of beef high. But when all the months of an array are taken together the general tendency for the progressive increase of beef cattle prices with each increase of hog prices is evident. Two characteristics are said to be correlated when there is a tendency for the changes in the value of one to depend on the changes in the value of the other. The two characteristics may increase together or one may increase while the other decreases and even in a part of the table the movement of the changes may be together and in another part the two series of changes may move in opposition ; the es- sential evidence for the presence of correlation is that the meas- urements change from array to array. In uncorrelated data there is no tendency for the distribu- tions of the arrays to change from type to type. In perfectly correlated data there is an exact connection between the values of the two characteristics. If height and weight were perfectly correlated, for instance, all persons of a given height, say 68 inches, would be of the same weight and hence all the frequencies of the weight array of type 68 would lie within a single subgroup. Between the two extremes of per- fect and of no correlation there are all degrees of correlation. Exercises. 17. Study the degrees of correlation shown by the tables con- structed in working the exercises of this chapter. 18. Is it possible to find actual data which shows absolutely no cor- relation? Construct an imaginary table which shows no correlation. CHAPTER VIII. THE CORRELATION RATIO. The Mean as Representative of the Array. In Chapter IV it was stated that the modal deviation is the most frequent deviation ; that is, the most typical deviation of a distribution. Because the mode cannot be computed by a simple and uniform process of arithmetic the mean is a more practicable representa- tive of the array. And this substitution of the mean for the mode will rarely produce a serious error. Since the mean of the frequencies of an array is taken as the representative of the deviations of the array, from the defi- nition of correlation on page 73 it is apparent that the amount or degree of correlation in the data will be indicated by the varia- tion in the means from array to array. Regression Curves. The variation in the means of the arrays is shown graphically by the curve of means, which is called a regression* curveXfrv * /u vl **'*****} uJ^^'^vvvf Since there are two sets of arrays there are two regression curves. Coordinate Axes. It is usual to take for the horizontal or .i--axis the horizontal line thru the mean of all the /s; that is, the horizontal line at a distance y below the base line of the table, and for the 3'-axis the vertical line distant ~r from the left marginal vertical. The point of intersection of these two lines is called the center of the table. Deviations to the right are taken positive and those to the left negative; deviations down- ward from the new horizontal axis are positive and deviations upward are considered negative. Sometimes this convention of plus downward and negative upward is departed from. No con- fusion can result however if it is remembered that the directions in which an attribute is increasing is always taken as positive. * So called by Francis Galton for certain reasons which arose in his investigations in biology. The name has become general. (74) 1 NT KOI) I ( Tl<>.\ TO M AT 1 1 KM AT 1C AL STATISTICS /5 Exercises. 1. Draw the axes and regression curves for each of the correlation tables of Chapter VII. 2. Study and compare the forms of the regression curves of Ex- ercise 1. Correlation and the Regression Curves. In uncorrelated data the means of an array does not depend on the type of the array ; that is, does not change from array to array, and hence the unchanging value of the means must be the same as the mean of all the y's. The regression curve for uncorrelated data therefore ap- proximates a straight line coinciding with the horizontal axis. For correlated data the regression curve diverges or deviates from this position of coincidence with the axis. It must be noted that the shape of the regression curve may be quite irregular without effect on the degree of correlation present in the data; it is the distance of the means from the axis that counts in de- termining the degree of correlation present. Hence any numeri- cal measure of the extent of correlation in the data must de- pend on the "deviation of the means from the horizontal axis thru the center. Since there are two regression curves and two axes there are two correlations in each correlation table and their numerical measures involve the deviations of the respective regression curve's from the corresponding straight lines thru the center. Thus the dependence of height on weight and of weight on height are two distinct correlations. Mean Squared Deviation of the Means of Arrays. The mean squared deviation is the most convenient measure of the deviation of the means of the arrays. In computing this the means of the arrays are first written in a vertical column and then the difference between each mean and the mean of all the variates is set down in a second column. Because the differences are used only in the squared form it is not necessary to retain a negative sign. The third column in the computations of Table X, page 77, contains the squares of the differences. Since the means of the array are used as the representatives of the individuals of the respective arrays each of these individuals is possessed of 76 INTRODUCTION TO MATHEMATICAL STATISTICS the squared deviations. Hence each square must be multiplied by the respective frequency of the corresponding array. The resultant products form the fourth column. The sum of this fourth column is the total sum of squared deviations and this sum divided by the total frequency is the mean squared deviation. The Correlation Ratio. The mean squared deviation just obtained would be a significant measure of correlation were it not for the fact that it does not take into account the disper- sion of the data as a whole Without changing the mean and the frequency of a single r-array, it would be possible to spread out each array to twice its length. This alteration would concern the dispersion of the data as a whole but would leave the mean square deviation from the horizontal axis unchanged. It is evident that the value of the mean square deviations of the means of the arrays is of less significance in the more spread out data. Hence the disper- sion of the data as a whole must be considered in interpret- ing the value of the mean squared deviation. The dispersion of the data as whole is given by the standard deviation of the frequencies of the totals in the vertical sum column. The smaller this mean square deviation the more significant is the deviation of the means, and the larger this standard deviation the less significant the deviation of the means. It is therefore reasonable to divide the square root of the mean square deviation of the means of the arrays by the standard deviation from the marginal column. The quotient is called the correlation ratio, and is denoted by the Greek letter 77. The computation of the correlation ratio for the dependence of student weight on height follows. A carefully planned outline scheme of computation must be made before the figures are entered. The means and the one standard deviation were computed in the usual manner. We have, for the data as a whole, 5 = 7.9, (j 2 = g.^. The means of the arrays are written in the second column just after the frequencies. The differences between the means and y follow in the third column. The squares, and the product of the squares by the frequencies are the fourth and fifth columns respectively. The symbols ex- plained in Chapter VII are written at the head of each column. INTRODUCTION To MATHEMATICAL STATISTICS 77 Computation of 17. n x y x y y x (y yx) 2 i i x (y 2 9.5 1.6 2.56 5.12 10 4.7 3.2 10.24 102.40 11 4.4 3.5 12.25 134.75 38 4.6 3.1 9.61 365.18 53 6.1 1.8 3.24 184.68 93 6.8 1.1 1.21 112.53 106 6.9 1.0 1.00 106.00 126 8.0 0.1 0.01 1.26 109 8.8 0.9 0.81 89.19 87 9.7 1.8 3.24 281.88 75 10.9 3.0 9.00 675.00 23 10.3 2.4 5.76 132.48 9 11.1 3.2 10.24 92.16 4 10.5 2.6 6.76 27.04 750 2309.67 2309.67 2 o 3146 "n ~ 750X9.79 . 0.56 TABLE X. Exercises. 3. Compute the value of ^ for the dependence of monthly precipi- tation upon monthly mean temperature as shown by the data of the Columbus Weather Station. 4. Compute the value of ^ for the correlation of Chicago top hog prices with Chicago top beef cattle prices as shown in the table of Ex- ercise 12 of the preceding chapter. 5. Compute values of f\ from the tables of Exercises 13 and 14 of Chapter VII. Two Values for 77 in Each Table. From the method of computation it is clear that there are two values for 77 in each correlation table, one for each regression curve. The cor- relation ratio of weight with height, for instance, may differ con- siderably from the correlation ratio of height with weight; the dependence of precipitation on temperance may be of a decidedly different degree from that of temperature on pre- cipitation. The two values of 77 do not ordinarily differ markedly but there can be no apriori assurance that they will be essentially of equal value and hence it is necessary to compute the two values separately in case both are desired. To distinguish the two /5 INTRODUCTION TO MATHEMATICAL STATISTICS measures, for the dependence of y on .r, of weight on height, the symbol rj y is used and the symbol ?/ x refers to the dependence of x on y. Exercises. 0. Compute the value of 77 for the correlation of height with weight and compare with the other value of -n computed on page 77. 7. Compute the value of ^ x from the precipitation-temperature correlation table, and compare the values of x and = r y, y where o- y is the right hand marginal standard deviation and x 3V- y n x * * x (x *) N X O 1 9.5 1.6 2 6.9 13.8 22.08 2 4.7 3.2 10 3.9 29.0 92.8 3 4.4 3.5 11 4.9 53.9 188.65 4 4.6 3.1 38 3.9 148.2 459.42 5 6.1 1.8 57 2.9 165.3 297.54 6 6.8 1.1 93 1.9 176.7 194.37 7 6.9 1.0 106 0.9 93.4 93.4 8 8.0 +0.1 126 +0.1 +12.6 1.26 9 8.8 +0.0 109 +1.1 +119.9 0. 10 9.7 +1.8 87 +2.1 +182.7 328.86 11 10.9 +3.0. 75 +3.1 +232.5 697.5 12 10.3 +2.4 23 +4.1 +94.3 226.32 13 11.1 +3.2 9 +5.1 +45.9 146.88 14 10.5 +2.6 4 +6.1 +24.4 63.44 S x (* F) (y 50 =3018.4 x 50 (* *) = 0.55 TABLE XI. Exercises. 3. Compute the value of r for the monthly precipitation and tem- perature data. 4. Compute r for the top-hog- and top-beef-cattle data. 5. Compare the values o'f r in Exercises 3 and 4 and in the weight- height data with the corresponding values for -n. 6. Compute the value of r from the monthly price-of-hogs 'data of > Exercise 12, Chapter VII. Compare with the corresponding value for "n. 7. Does there seem to be a tendency for >? and r to agree more closely for highly correlated data than for material of small correla- tion ? 8. Compare the amount of labor involved in the computation of r with that involved in the computation of *7. The Relation of r to 77. In data exhibiting a regression which is truly linear the value of 77 is, of course identical with that of r. In the case of any but truly linear regression it is readily shown in Chapter XII that the value of r is necessarily less than that of 77. In fact if the regression curve is of a INTRODUCTION TO MATHEMATICAL STATISTICS 85 certain shape the value of r will be very small even tho prac- tically perfect correlation exists. Unlike the correlation ratio the coefficient of correlation expresses a property of the correlation table as a whole and not merely of one or the other of the two correlations of the table. Again, unlike the correlation ratio, the negative sign ob- tained in the final extraction of the square root (in the dis- cussion of page 76) has a significance ; it indicates that the regression line has a negative slope and hence that the con- nection between the attributes is inverse; that is, one attribute increases while the other decreases. Because both positive and negative values of r can occur there is no tendency, as there is in the case of >/, for small values of r to be larger than the actual degree of correlation would warrant. * Limiting Values for r. In data of zero correlation it is clear that the regression line coincides with the axis and hence the value of r must be zero. Reasoning from the relation of r to ^ we see that for truly linear regression perfect correlation leads to a value of r equal to unity. The unity value for r will be positive or negative according to the correlation is direct or inverse. According to the underlying theory of the coefficient of correlation for data in which a regression is not linear the value of r cannot be unity even tho there 1 is perfect correlation and hence r is necessarily smaller in value than the degree of correlation would require. Statistical Properties of the Coefficient of Correlation. The coefficient r is, as the preceding discussions show, a con- servative measure of correlation. In periodic data exhibiting a sinusoid form for the regression curve the correlation may be high but because the departure of the regression from linearity is so wide the value of r understates the correlation and hence its applicability in such data is not of importance. The characteristic importance of the coefficient r is in de- fining the slope of the regression lines. It furnishes the most convenient method for defining the general tendencies in the data. The rise of prices, for instance, during the last fifteen years can be readily measured by the rate of rise of the regres- sion line. 86 INTRODUCTION TO MATHEMATICAL STATISTICS Therefore for the single purpose of measuring correlation the coefficient of correlation is distinctly inferior to the corre- lation ratio both in convenience and reliability. It should never be used as a measure of correlation without first carefully test- ing the form of the regression. It does have however the highly useful property of giving the slopes of the regression lines. Test for Linearity of Regression. It would be suspected from the preceding theory and discussion that the difference between TJ and r should be an indicator of the departure of the regression from linearity. A somewhat more convenient meas- ure of this departure than the simple difference is the difference of the squares of 77 and r. Probable Deviations. The following probable deviations can be derived. (ir-) p. R. of r = 0.6745 - P. E. of (rj- r-) = - - yy__r a Vi? A practical criterion of linearity is to assume linearity when Vr r- < 2.5. f-35 8. Compute the regression equations for each of the correlation tables of Chapter VII. 10. How can the value of r be obtained graphically from the re- gression lines? Is this a practicable method of finding the value of r? 11. Compute the measure of departure from linearity, (V r 2 ) for the correlation tables of Chapter VII. 12. A correlation table has two measures of departure from linearity. Show that one regression may be linear and the other non- linear. 13. Show that if the value of r is high the regressions must both be approximately linear. 14. By extending the regression line estimate the price of live hogs for January, 1917. 15. What weight should correspond to a student height of 50 inches ? 16. What is the best estimate of the temperature for a month with a precipitation of 3.4 inches? 17. Discuss the value of the probable deviations in exercises 14-16. CHAPTER X. CORRELATION FROM RANKS. Rank in a Series. When the .data consists not of the direct measurements of the characteristics but of their order or rank in a series the correlation of the ranks may differ mate- rially from the true variate correlation. Let us define rank as position in a series so that an individual of rank one would have no individuals above or before it; an individual of rank Pwo would have one individual before it, etc. To pass from rank to variate correlation it is necessary to know the form of the distribution of the values of the charac- teristics. Only for normal distributions has the requisite theory been developed. It is consequently necessary to employ the same formulas for other forms of distributions, although this may sometimes open the way to serious inaccuracies. Let the ranks of the same individual in regard to the re- spective characteristic be V K and v v . Let there be TV individuals and let v x and v y denote the respective means of the two series and sv x and ov y the standard deviations. Also let all the measurements of each characteristic be dis- tinct in value ; that is, let there be no equal measurements. Theorem I. The mean ranks v* and v y are each equal to (N + i)/ 2. Since there are as many ranks as individual measurements and since the ranks proceed uniformly from I to N the mean is (N+l)/2. Theorem II. The standard deviations of the ranks are N each equal to . 12 For, #<^ 2 ~3 V *--v x , (87) 88 INTRODUCTION TO MATHEMATICAL STATISTICS on applying the rules X/V 2 = i/6N(N + i) (2N + i) and N(N+iY- = i/6N(N + i = _ (TV 3 N). 12 I Therefore ov x 2 = (N'~ i ) . 12 The following theorem is necessary for the computation of rank correlation. Theorem III. // o- x = o- y , r = i - 7 ~ (x ~ y) 2 For, (.r v) - = .r 2 + y z 2.ry, or Na- (x _ y) = Na x 2 + N) n . Ans. t*? = npq (p q) ; ^^=npq (3(w 2) pq -f 1). (See Hardy, "Construction of Tables of Mortality," p. 107 et seq.). The computation of the moments about the mean either directly or by first computing about a convenient origin and then transforming to the mean is open to the serious practical objection that there are no convenient methods of checking the results. The arithmetic of the following summation methods is comparatively brief and admits of satisfactory checks on the cor- rectness of the results. The First Summation Method of Computing the Mo- ments. The theory of this and the second summation method which follows immediately after it are somewhat detailed but both are entirely elementary throughout. Let us take a distribution with the five frequencies y^y^y*, y*> y$ corresponding to values of x equal to 1,2, 3, 4, 5. By the ordinary direct method, the first moment about the point x = o is y^ + 2 ^2 + 3^3 + 43U + 53V Now let us arrange the 98 INTRODUCTION TO MATHEMATICAL STATISTICS y's in vertical order and add in the manner indicated in the sec- ond column below. (i) (2) (3) yi y* + y 2 + y 3 + 3' 4 + y 5 y\ + 2 3' 2 + 33> 3 4- 4>' 4 + 53v y 2 y*+ 3' 3 + 3' 4 + 3v, $2 + 2$*+ 33^ + 43v y s y 3 + 3' 4 + y 5 y a + zy* + 33v y*, 3' 4 + 3V 3U + 2> V> 3^6 3 ? 5 3V, + 33' 3 + 43' 4 + 53V 3'i + 33' 2 + 63-3 (4) (5) 2O>> 4 353;, 33' 3 + % r 4 + ioy 5 3' 2 + 4V 3 + io.v 4 + 203-, 3's + 33' 4 + 6v 5 3? 3 + 43/4 + 103-,, 3' 4 + 33 ; 5 3' 4 + 43V, 3'r, 3^5 3 f i + 4>' 2 + 103-3 + 2ov 4 + 353V, 3', + 5 3 ' 2 + 153,3 + 35 y 4 + 703^ The sum of the second column is thus the same as the first moment. By the direct method the second moment about the j ,*v same point is 3^ + 4y 2 ,+ 93/3 + i6v 4 +.253/5 divided by N. Let Tt. us designate the sum of column (2), when divided by N, by S; the second divided by N, by 5V, ; the third when divided by N, by ^3, etc. yi + w* + 33'a +.,-.... y, + A +6^3 + . . . That is 5 = S =- yi + 4J 2 +v I0 3V + .-. 3' + SVs + 153^3 + . . $4 ~ > S 5 " AT A^ It is apparent on inspection that 2$ 3 S 2 is the second moment. In symbols, 2 i - - y : 2 y N N i + 53V) = (3 ( ! + 43' 2 + 93's + i6;y 4 + 253;,). N That is, fif 2 = 2$ 3 S 2 . INTRODUCTION TO MATHEMATICAL STATISTICS 99 The third moment about the same point of reference is I - (Vi + 8}' 2 + 2?y 3 + 643;, + I2$y 5 ). N For this moment the following relation is readily verified : n' 3 ==6S,~6S 3 + S 2 . Extending the reasoning to the case of the fourth moment, we have We thus have four relations connecting the moments with the S's: P\ ^2> fi 2 = 25 :{ - S 2 , fA 3 = 6S 4 6^3 + S 2 , fi' 4 =? 245,, 36S 4 + 14^3 S 2 . Transferred to the mean as origin by the formulas of page -88"" these moments become S 2 = d; ^ =/u ' 2 _ # = 2 5, _ S, d* = 2S. d* - 65 4 6S 3 + S, - 3d fi 2 (8, 6S 4 - 3 /x 2 3^(1+ d)+ d- 3d t* 2 d\ and similarly, /x 4 = 245*.-, 2/* 3 { 2(1 -\- d) + i}- /u, 2 \6 (i + rf) ( 2 + d)-i\-d(i + d) (2 + d) (3 + rf). It is evident that the same relations hold for a larger num- ber of classes than the five which we have assumed for the pur- pose of illustrating the method. These relations connecting the moments about the mean with the sums obtained by this process of summation are ma- terially shorter and more convenient than the direct formulas. It will be noticed that the sum of any column is the largest IOO INTRODUCTION TO MATHEMATICAL STATISTICS number in the next column, so that a satisfactory check on the summation is afforded. It is possible, however, by taking the point of reference near the mean, to still further shorten the labor of the computation. The Second Summation Method of Computing the Mo- ments. To illustrate the second method let us take a distri- bution of eight classes and assume the fixed point of reference at class 5. Then we sum from both top and bottom to, but not including, the frequencies of class 5 in accordance with the fol- lowing scheme. V ^ (i) (2) (3) (4) yi yi yi y* y 2 y 2 + yi y a + 2y 1 y 2 + 3^1 y* y 3 + y 2 + yi ' y 3 + 23/2 + 33- t y s + 3y2 + 6yi y* y + ys + y* + y^ y* + ^y s + 33/2 + 43^ y ye 3^6 + y? + ys y* + zy 7 + 3 y* y* + 3^ 7 + 6y s y 7 3'7 + 3's ^7 + 23^8 ' 3'7 + 33's ys y a , y* y* (5) (6) yi yi 3^2+ 43'i ^ + 43^7 + ioy g TO + 53^7 + 153's y 7 + 4y 8 y 7 + 53> 8 y 8 y 8 Forming n\, about the point x = 5, by the direct method we have i i /*'i = - (3' 4 + 23' 3 + 33^2 + 43'i) + - - (y 6 + 2y 7 + 3y 8 ) A^ TV But //! has been defined as equal to S 2 . Hence S 2 is obtained by subtracting the last upper summation term from the last lower term in column (3). INTRODUCTION TO MATHEMATICAL STATISTICS IOI By direct computation, i 43' 7 TV But p!, = 2Ss 5*2. or 2^ 3 =,/i' 2 + 5" 2 . I Hence 2$ 3 = (y 4 + 4y 3 + $y 2 + i6y l + y + 4y 7 + 9^ 8 N I i i Therefore, S 3 = - (v 3 + 3V 2 + 6v a ) H --- (^ + 3.V? + 6y 8 ). TV TV That is, ^ 3 is the sum of the. last term in the positive, or lower, summation and the last but one (the last term as written in the scheme) of the negative summation terms in column (4). Likewise, i i S 4 = (y 6 + 4y 7 + I0y 8 ) -- (3;., + 4V! ) , the difference TV TV between the last positive summation term and the last but two of the negative summation terms in column (5). i i And 5\ = (y 6 + 4^ 7 + J 53') H --- yi tne sum of the last TV TV' positive summation term and the last but three of the negative terms in column (6). After the S's are obtained the formulas of page 91 are ap- plied to obtain the /x's. As in the first summation method the partial summations can be added for checks on the Arithmetic. This second summation method will be found very con- venient, especially when the number of classes is large or the frequencies are of considerable size. IO2 INTRODUCTION TO MATHEMATICAL STATISTICS The following computations for the data of page 24 illus- trate the two summation methods. Computations of this length should never be attempted without first arranging a complete form with a place for each number and that place so chosen that the number is in its most convenient location. The entire computation should be planned before the arithmetic is begun. "Class. Freq. 1 2 750 - 5925 28463 105421 2 10 748 5175 22538 76958 3 11 738 4427 17363 54420 4 38 727 3689 12936 37057 5 57 689 2962 !)247 24121 6 93 632 2273 6285 14874 7 106 539 1641 4012 8589 8 126 433 1102 2371 4577 9 1Q0 307 669 1269 2206 10 87 198 362 600 937 11 '75 111 164 238 337 12 23 36 53 74 99 13 9 13 17 21 25 14 4 4 4 4 4 Totals 750 5925 28463 105421 329625 , = 37.95 5 4 = 140.56 Ss = 439.5 (1). (2). (3). (4)., (5). (6) . t/(l + rf) = 70.31 (l + rf) (2-f-rf) =696.069 3(l + d) =26.7 4(1 + 0 1102 2271 447' 8085 > S,= (1102 427)/750 = 0.9 Ss = (2271 + 367)/750 = 3. 52 S 4 = (4477 222 ) /7.50 = o . 67 S t = (8085+ 91)/750 = 10.9 The computation from this point on is the same as under the first method, except that the origin is at class 7, or the height class 67, instead of height class 60. Exercises. 9. Compute the moments for the frequency distribution of page 29 by the two summation methods. 10. Demonstrate the proof of the two summation methods for n classes. 11. What difference would result in the computations of the second summation method if the origin were taken at the eighth class instead of the seventh so that the upper sum in the first summation is the larger? Correction Formulas for the Moments. All the methods that have been proposed for finding the moments assume that the frequencies are concentrated at the center of each class while actually the deviations are continuously distributed from one end of the range to the other so that there is nothing in the nature of the data to correspond to the classes, mid-ordinates, etc. A certain degree of error is therefore intro- duced by these methods. We are not really working with the actual deviations but with the artificial classes built up from the actual devia- tions. In how far then are facts, which hold for the classes, of sig- nificance for the actual variates? It may well be that in ordinary statis- IO4 INTRODUCTION TO MATHEMATICAL STATISTICS tical work the closeness of the measurements may not warrant taking these errors into account but the corrections are easily applied and fre- quently make a significant difference in the results. However the cor- rections should not be applied to data not accurate enough to warrant such care no matter ifthe corrections are easily applied. The methods adopted in computation must never be such as to presuppose more accu- rate data than that in hand. When the distinction is made between the moments as calculated from the class frequencies and deviations and the moments calculated under the assumption of continuous variation, it is customary to denote the values as computed by *\ v^ v 3> v 4j and "/, " 2 ', V, ?/, and reserve the corresponding /"'$ for the values under the assumption of continuity. When no account is taken of the distinction between the discrete and continuous series of frequencies, the A*'s alone are used, The "'s are often spoken of as the raw or unadjusted moments and the /*'s as the adjusted moments. The adjustment or correction formulas are : ="3 The theory of these corrections is due to Dr. Sheppard and to Professor Pearson. A simple demonstration of the formulas is that of Bio. IIT, p. 308. According to the underlying mathematical theory these cor- rection formulas hold in strictness only for a frequency curve with high contact at each end. When these conditions are not satisfied it is probably best not to apply the corrections. Theorem I. Changing the unit of measurement of the deviation; that is, multiplying each deviation by a constant, multiplies a moment by that constant raised to a power equal to the order of the moment. For, Mn = TV and 2 ( rx ) n r = r n 2.r n v . Theorem II. Multiplying or dividing each frequency by a constant does not change the moments. For, INTRODUCTION TO MATHEMATICAL STATISTICS IO5 Because the values of the third and fourth moments depend on the unit of measure of the deviations it is usual to employ these two moments in the forms & and /? 2 , respectively, where & = /V//* 2 3 and /3 2 = /* 4 /> 2 2 . To show that ^ and (3 2 are inde- pendent of the unit of measure of x let us write /?! = and fi. 2 = - . Then let .r be changed into r# where r is (*V any constant. y) 2 .r* N This ^ives /8 1 = , and similarly for ft 2 . Exercises. 12. Show that adding a constant to each deviation changes the moments. 13. Show that adding a constant to each frequency changes the moments. Summations. The following exercises are intended for practice in using summations and should be carefully worked through in order that a comprehension of the somewhat de- tailed discussions of subsequent chapters is not hindered by a lack of familiarity with the necessary algebra. Exercises. 14. Show that the square of ^xy is 2-*"V + 22'*" s :y s * t ;y t where the subscripts are attached in the second summation to indicate the prod- uct of unequal deviations, and all deviations are measured from the mean. 15. By actually computing the separate value of each summation verify the relation (2-ry) 2 = 2-rV -f- ^^ K y s ^ t y t for the distribution 1, 2, 5, 2, 1. 16. Establish the relation (S-ry) 3 S^V + 322 W^r 17. Establish the relation (2-rv) 4 = 2*V + 18. Show that (2-r.v) (^y) = 19. Prove that x\ + x\ > 2.r s ^ t and hence 2(^* B + **>)> 22.r s .r t . 20. Show that (2. r \ v ) (2y) >'i) We have and (2y) .r 4 i. Vi^-f + .r*=.v 2 .vi + x\y& + . . . . -f - syy + 2 ^y-y t = 2 ^V + 2 ^^ t (* 4 8 + ^ 4 t ) , Also ( 2.r s y ) = S.rV -f 22.^3^.3. t =2, r y + 22^^^ IO6 INTRODUCTION TO MATHEMATICAL STATISTICS Therefore (2^) (Zy) > (2^)', if 2,-y -f Zy a y t (S a + *\) > 2*y 4- 2 i. e. if 2 Vt + <) >22 3Wr 8Jr * t . But the sum of the squares of two quantities is always greater than twice their products and hence each term on the left is greater than the corresponding term on the right, thus proving the theorem. The algebraic discussion may be more easily followed if a summation of only two or three terms is first employed. 21. Prove that (2.r 4 v) (2jr a y) > (2* 3 ;v) ? . 22. Prove that & >&. The Moments and the Equation of the Smoothed Curve. It is shown in Chapter II that a smooth curve is fitted on the basis of principles which are assumed true for the data as a whole. One such principle is that of equality of area which assumes that the area under the curve is equal in numerical value to the total frequency of the distribution. The principle of equality of moments assumes in addition to the equality of area and total frequency that the first, second, third and fourth moments computed directly from the data are respectively equal to the first, second, third and fourth moments computed from the adjusted frequencies. To illustrate the application of the method of equality of V moments let us fit a straight line to the points (2,4), (3,3), (4.7). (5,6). The equation of the required line is y = m.v -\- b where m and b are to be determined. The adjusted y's in terms of m and b are 2m + b, 3111 + b, 4m -\-b, $m + b. The equality of the area and the total frequency can be expressed as an equality of moments if the moment of zero order is permitted. This is possible because any num- ber with an exponent zero is equal to unity. Hence 2 .4 -f- 3 .3 + 4.7 + 5.6 = 4+3 + 7 + 6 = 2o. Also, 2.(2m + b) + 3.(3* + b) + 4.(4* + b) + 5. (5^ + b) i 4 w -f- 46. Hence, on equating the two zero moments, 1 4m -f- b = 20. From the first moment, 2.4 + 3.3 -)- 4.7 -f- 5-6 = 2 (2m -f- b), + 3(3 + b) + 4(4111 -f- b) + 5(5w -f b), we have + 146 . 75. INTRODUCTION TO MATHEMATICAL STATISTICS IO7 Solving these two moment equations simultaneously we have Therefore y = 2*5* - $rft is the required equation of the straight line fitted to the given points on the basis of the asump- tion of equality of the zero and first moments respectively. ^ Exercises. ^ 23. Fit a straight line to the preceding illustrative points on the assumption of equality of the first and second moments respectively. Should the resulting equation agree exactly with the equation found above ? K24. Fit a straight line to the points (1,5), (3,8), (4,6), (5,5), (7,10). w 25. Fit a parabola, y = a -f bx -\- ex*, to the points of Exerc- cise 24. CHAPTER XII. FURTHER THEORY OF CORRELATION. A Second Concept of Correlation. In Chapter VII two attributes are said to be correlated when there is a tendency for a change in the value of one to be followed by a change in the value of the other. And the ratio of the standard deviation of the means of the arrays to the standard deviation of all the variates was taken as the measure of the degree of correlation between the attributes. A second approach to the matter of cor- related variates is as follows. On the assumption that the mean is the representative of the variates of an array the dependence of y on x is exhibited by the curve of means ; that is, by the regression curve. Obvi- ously this curve is a significant measure of the dependence of y on x only insofar as the means are in fact representatives of the variates of the respective arrays. Within this limitation the spread of the variates about the means of the successive arrays is a measure of the extent of dependence of y on x\ that is, of the correlation of y with x. Let 3" ay denote the mean squared divergence from the re- gression curve. Then As is explained in Chapters VIII and IX, this mean squared deviation must be divided by , a first moment about a horizontal line thru the mean, (both y and x are assumed to be measured from their respective means) and hence zero from an obvious extension of the theorem of page 35. Likewise 3n x . x = o. Hence, from the first moment equation, a as equal to zero. In the second equation we have, . y . x. The summation 2w x x ~ has been taken equal to NG^ and 2w y . y~ equal to NG y '~. It seems consistent with this notation to assume 22w xy v.r= Nro- x (T y where r is the numerical constant of Chapter IX. On reducing the second moment equation we have N^cr x o- y = b yx . N(T X 2 CTy Therefore & yx = r . o CTy and hence y x = r . . x is the required regression equation. Exercises. ff * 5. Derive the regression equation x y = r y. 6. Prove in detail that ^^n xy y = o f where x and y are measured from the mean. When x and y are measured from the original axes the regression equations become o-y y x y =r (x x) x y x = r (y y) . CTy The Relation Between ^ and r. It was shown in Chapter IX that r) and r have the same numerical value when the regres- INTRODUCTION TO MATHEMATICAL STATISTICS III sion is truly linear. Hence a lack of agreement in the values of rj and r is an indication of a divergence from linearity in the regression. The difference between r? and r is expressible in terms of the divergence from linearity by the two equations: MV (V r' 2 ) = 2" x ( Y* y*) 2 and No- x -(i)x- r 2 ) = 2 y (X y ^ y ) 2 , where F x and X x are the regression line means. To prove the first of these formulas let us add and subtract y for each term in the summation 2^x(F x y x ) 2 . We then have after expansion, &) 2 = 2fM(^ y) 2 2(F X y) (y x -y) + On substituting from the regression equations this expanded form becomes Oy r. 9. Show that the same pair of equations will be obtained for the regression lines if the assumes lines are fitted to the individual frequen- cies instead of to the means of the arrays. 10. Prove that for truly linear regression (7 y r(T y . 11. Show that for truly inear regression (7.r y 2 = r xy and hence 22 = A^2 2N -j- N, n x n y n x n-y N n x n y i i since 22 - - = 2w x 2 v = - 2* N = S x = N. N N N n~ xy Therefore 2 = 2 i . The probable deviation of < is discussed at length by Pear- son and Blakeman*. Exercises. 1. Compute the value of for the data of Table VIII. 2. Why not divide the square of the difference for each sub-group by the actual frequency of the sub-group instead of the frequency under the assumption of no correlation? Properties of . In data selected entirely at random ; that is, where n xy = for all values of .r and v, the value of is of course zero. It does not necessarily follow, however, that for absolutely uncorrelated material; that is, for data having rj x = rfy o, the value of < must be zero. A moment's consideration will show that the greatest value for 22 - - . taken over the subgroups of any one array, is unity and that this greatest value cannot be attained unless the subgroup of intersection is the only subgroup with non-vanishing frequency in either of the two arrays intersecting in that subgroup. It follows that, if the distribution is not square, the number of arrays giving the maximum value cannot be greater than the number of the longer arrays. Hence in symbols, if r and j are the numbers of arrays of the respective attributes and r = j or r < s, the greatest value for < 2 is, r i. *Biometrika. Vol. V, p. 191 et seq. 122 INTRODUCTION TO MATHEMATICAL STATISTICS For illustration, in the table abed e f g h i j k 1 at least one horizontal array must have more than one non- vanishing subgroup frequency. Let this table be a o o o o f o o o o k 1 n 2 xy a 2 f 2 k 2 I 2 Then 2 = H 1 (- n x n y a. a f.f k(k+l) l(k+l) k I k+l k+l Exercises. 3. Show that for =0, the means of the x and the y arrays lie on vertical and horizontal lines respectively. V 4. Show by actual substitution in the formula <* = 2 -- 1 that = when w xy = N 5. Verify the just preceding theory by assigning different combina- tions of values to the symbals a, b, c, d, e, f, g, h, \, in the distribution : a b c d e f g h i 6. Do the same for the distribution a b c d e f g h i j . 7. Show that the greatest value of a for a table of the form of Table VIII is 13. 8. Give an algebraic demonstration for this theory when applied to a general distribution r by ^ fold. INTRODUCTION TO MATHEMATICAL STATISTICS 123 The greatest disadvantage of ^ as a measure of correla- tion arises from the fact that its value depends on the number of arrays in the distribution so that it is almost entirely useless for purposes of comparison Another disadvantage lies in the fact that, notwithstanding the logical simplicity and directness of the theory underlying the method of contingency, in prac- tice the interpretation of variations in the value of < is a mat- ter of much difficulty For instance, when equals 2.5 what is the significance of an increase of 0.5 in its value? How much greater is the degree of closeness of association in the latter case than in the first? A third objection is that for a large table the labor of computation is heavy. The first objection above is partially overcome by making use of the coefficient of contingency, \ . This constant is given added prestige by the following relation. It may be shown that for a finely divided distribution of a particular type the co- efficient of contingency and the coefficient of correlation are equal in value. Consequently in certain forms of distribution this fur- nishes a convenient method of obtaining the value of r. How- ever, care must be taken to make sure that the assumptions essential for the validity of this theorem are approximated to with sufficient closeness. Ordinarily it is better to make use of methods ivhich do not rest on so extensive assumptions. An approximation to the probable deviation of the co- efficient of contingency is to take one and one-third the prob- able deviation of r. Exercises. 0. Compute the coefficient of contingency for the data of Table VIII and compare with the value of r already computed. 10. Do the same for the data of Table IX, Chapter VII. 11. By combining arrays in the distribution of Table VII and com- A \ puting the successive values of \ show the effect of different widths L> of classes on the value of this constant. 12. Show that the coefficient of contingency is smaller than the value of r computed by the method of moments. 13. Compare the reliability of the coefficient of contingency for highly and for slightly correlated data. 124 INTRODUCTION TO MATHEMATICAL STATISTICS 14. Compare the labor required to compute the value of with that for rj, In concluding this part of the discussion of the method of contingency it may be stated that when the attributes can be definitely measured there is no practical advantage in computing the value of . Non-Quantitative Characteristics. Because the formula n' 2 xy 2,2 - does not contain the deviations x and y and contains only n t n y the frequencies of the subgroups it can be applied to distribu- tions in which it is impossible or undesirable to assign numerical values to the deviations ; for instance, a distribution of hair and eye color, of degrees of intelligence in drawing and music. Such distributions are said to involve characteristics not quantitatively measured or measurable. Thus in its fundamental theory the coefficient of contingency applies with equal validity to quantitative and to non-quantita- tive data. Moreover, since the number of classes in the case of non-quantitative distributions is ordinarily small the labor of computation is not unduly heavy, and hence the coefficient of contingency is of greater practical importance for this kind of data than for quantitative data. However it will now be shown that for non-quantitative distributions, the correlation ratio is a more convenient and satisfactory measure of correla- tion than is the coefficient of contingency. A correlation problem very similar to that arising from non-quantitative data is the finding of the degree of correla- tion when the measurements of the attributes in quantitative data are classified into very broad classes ; to find, for instance, the extent of the tendency for under-height and over- weight to be associated. Further than the effect that so broad classes may have in producing errors in the results obtained by the formula for 77 there is no theoretical objection to the direct application of the theory of the correlation .ratio to a distribution obtained by grouping into broad classes. However, the theory of the correlation ratio does not ap- ply directly to strictly qualitative data and for that reason 7 arc identical and that in other ordinarily oc curing cases\ the values of the two constants are highly correlated. Exercises. 1<). Arrange the data of Table VIII in the following form and com- pute the values of ?? and 0. Height. 1 "5 Under 68 Over 68 Totals Under 137 Over 137 Totals The Four-fold and the Nine-fold tables. We shall now derive the formulas for r/ and < for a 2 x 2 table and obtain the computation formulas for >/ for a 3 by 3 table. The same method might be employed to derive special formulas for each type of table. In the absence of special formulas the general formula for >? can be applied directly. Let us take the four-fold table, i + M. 1 + 2M :a //., W :1 + We have v N N \ ;/,, n..., Similarly, v, = I + . and v., i + - ;/, //, Substituting these values in the formula. A'T;>/ y ;/ , ( \~, y ) " + n., ( y, y ) -, where Gm y - is the mean s([iiared deviation of the means of the arrays. 126 INTRODUCTION TO MATHEMATICAL STATISTICS We have after some detailed reduction, Nn l . n Also N& v 2 = N ' Therefore 9 = - - From the formula No-- = ^ - I it is readily shown f by direct computation that =. - The equality of and ?; for the fourfold distribution is therefore demonstrated. For the nine-fold table, we have by a reduction similar to that for the 2 by 2 table, . 2 . n 1_^C L - / . A^ then, on substituting for y x and y, n X9 H- 2w X o "I ~ -/ 2/ --- N Therefore, INTRODUCTION TO MATHEMATICAL STATISTICS 127 2M X3 ) 2 ^ 2H X2 -f 2W X3 y) 2 = *- - 27VZ 2 N \ = 2 - - - Nl 2 . Similarly 2 y (v y) 2 = N(l \-)-\-2n .,. S -^- Nl 2 W- Hence n z l 2 ) +2n. Exercises. 17. Compute the value of ^ from the following distribution of the variation in receipts and prices from month to month of live hogs at Union Stock Yards, Chicago, from 1901 to 1914. Receipts. 50 50 50 50 25 10 7 24 cannot be equivalent for a table larger than four-fold, because there are two ?/s for each distribution and only one . The follow- ing theorems may be stated. 1. When < = o, the value of rj for y on .r and for ,r on y arc both zero; that is, -rj y =:->y x o. 2. When r) y = r/ x = o, it may ordinarily be expected that will be practically zero but it is not absolutely necessary that such be the case. 128 INTRODUCTION TO MATHEMATICAL STATISTICS 3. When only one rj is zero, it is most likely that $ will be small in value. 4. When takes the maximum value, yj y = y K = i . 5. When r) y =,rj x i, takes the maximum value. 6. When one -q only is unity it is most likely that the value of twill riot differ greatly from the maximum. 7. There is a close correspondence bettveen the values of and the ys for data of all degrees of correlation. Discussion of the Theorems, On substituting the rela- W x Hy tions xy = - in the formula for r/ y and for T/ X it follows im- N mediately that y = y l = y 2 y and hence that -rj y = r; x = o for = o. In regard to theorem 2, it will now be shown that the nine n K n y relations, w xy = , which result for a nine- fold table when N =. o, can be reduced to four independent relations, which re- sult when (f> = o. That is, if there are four such relations the other five must hold true and the value of $ is necessarily zero. In other words, the vanishing of $ imposes four and only four restrictions or conditions on the data of a 3 by 3 table. //^/.j w 2 w.i MI n. 2 For if, = - -, 01 = - , w 12 = - -'-, and N N N n 2 n.<, . { . ?/.! w = - it follows that ;i 31 = - -'- . Let us substitute the N N equivalents for n lt and w 21 in the equation 31 ;= :1 n lt w 21 . This substitution gives N : AT and similarly for the remaining relations. The vanishing of r; y implies the three relations. INTRODUCTION TO MATHEMATICAL STATISTICS I2Q It can be readily shown that only two of three relations are independent. That is, if the first two relations hold, the third is necessarily true. If rj x as well as yj y vanish the three additional relations w 21 -f 2 31 ;/ 22 + 2w 32 32 + 2w 33 2 .+ 2 3 = - are im- w :1 n :2 w :3 AT plied. Here again only two of the three additional relations are independent and of the six relations implied by the vanishing of both rj y and TJ X it is only a matter of algebraic detail 'to show that only three are independent. That is, the vanishing of both rj K and r/y imposes one less condition on the data than does the vanishing of <. And hence it is not necessarily true that < o when r/ x r/ y = O. As to the maximum values for these constants, the rela- tions r/y T/ Z = i require that there be but one non-vanishing frequency in each array of either sense and hence the condi- tion for a maximum value for $ is satisfied. The converse rela- tions are evidently true. For only one r/ equal to unity, however, the data might be arranged, for instance, in the form of the table, a o o ooo o b c, when 2 would not have the maximum value. If a large number of distributions were made up from the same population and the values of rj and of < computed for each distribution, it would be found that in the long run a large value of i) was associated with the larger values of and vice versa. But to obtain a formula for the correlation of r/ and < is a mat- ter of considerable algebraiac detail and the resulting formula is so complicated that it is practically worthless*. For this reason the algebraiac discussion of Theorems 2, 3 and 6 is not given in the complete form. We have outlined the method of showing that the value of rj for a non-quantitative distribution has a close connection to the * Compare Blakeman, "The Probable Error of the Coefficient of Contingency" loc. cit. I3O INTRODUCTION TO MATHEMATICAL STATISTICS value of ; that is, that rj cmd are highly correlated, for such data and hence the correlation ratio may be used to measure the degree of correlation or association in the data with all the as- surance that attaches to the method of contingency. It is of dis- tinct practical advantage to have one coefficient or index of cor- relation for all kinds of data and for that reason the coefficient of contingency is not greatly used in practice. Caution is necessary at one point, however, for data divided into only a few classes does not convey the same amount of in- formation regarding the correlation between the characteristics as does the more detailed material and hence not the same degree of confidnce can be placed in the computed value of any constant derived from the less detailed table. For this reason compari- sons of the values of correlation measures between different forms of distributions must be carefully made and due account taken of the fact that for the small table the results do not warrant the same degree of confidence as do the results from the finely divided table. and W22 = N the remaining five relations of the same type hold. 19. Show that if *? 2 = the vanishing of *7 X imposes only one addi- tional condition on the data. 20 Show that if n y ='? x = 0, the frequencies of the nine fold table, can be expressed in terms of the marginal sums and frequency of any one sub-group. 21. Show that in the distribution 2 4 2 i? y = n x = and 0. 1 1 1 242 22. Construct a fictitious table having *? y = 1 and not having a maximum value. 23. Investigate the relations between i\ and for a 2 x 3 table. APPENDIX I. Introduction. The generalized frequency curves of Pear- son are so diverse in shape that a curve of this class can be found to fit any ordinary statistical distribution. By the following methods the fitting of a Pearson curve is reduced almost entirely to a matter of routine substitution in formulas, so that the practical statistician can make extended use of the curves without great familiarity with their theory. This discussion as designed both to present the working methods of the generalized frequency curves and to give the statistician who has a minimum of acquaintance with the higher mathematics some degree of familiarity with the underlying theory. The demonstrations are, for the most part, omitted. Many of the exercises have to do with the omitted theorems and derivations. In developing the theory of the generalized frequency curves it is logical, as well as practically convenient, to start with the normal curve and consider the general distribution as a mod- ification* of the normal type of distribution. The Slope Property. The particular modification which leads to the frequency of Pearson is obtained by generalizing the slope condition of the normal curve.** The slope of a curve at a given point is the tangent of the angle which the line touching the curve at that point makes with the X-axis. In the case of the normal curve, the ratio of the slope to the ordinate is negatively equal to the abscissa of the point. This slope property is generalized by taking the ratio equal;isot to x, but to (x + a] (b + ex + dx 2 } where a, b, are equalTnot to x, but to = 7-^ where a, b, c, d, are ^^ o -f- ex -f- a.v- constants. The slope of a curve is ordinarily denoted by the dy symbol dx * Compare Edgeworth, Jour. Roy. Stat. Soc. Also West, "On the Translated Normal Curve," Ohio Journal of Science, Dec., 1915. ** First extensively treated by Pearson in the article "Skew Varia- tion in Momogeneous Material" Phil. Trans. (131) 132 INTRODUCTION TO MATHEMATICAL STATISTICS In this notation the generalized slope property is expressed by the equation. i dy x + a y d.v b -f- ex + dx 2 The Constants, a, b, c, d. The statistical significance of each of the constants, a, b, c, d, can be readily determined. In Chapter IV, it is shown that the slope of a frequency 'dy curve is zero at a mode. Since - 1 -; that is, the slope, is zero dx when x = a, the constant a determines the position of the mode. The mode is therefore at a distance, a, from the mean. As explained in Chapter V. a is thus a measure of the skewness, of the lack of symmetry of the distribution. For a symmetrical distribution a is evidently o. When both c and d are zero the generalized slope equation x + a is merely the normal slope equation with .r replaced by - . b, This leads to the normal curve, (x-f a)-' y = k . e , where k is a constant. Comparing this equation with the standard normal equation, we see that b equals 2o- 2 multiplied by a constant. The degree of symmetry of the curve is indicated by the value of c as well as by the value of a. For, when x is positive, the term ex is added in the denominator and when x is negative it is subtracted. This tends to make the frequency curve steeper to the left than to the right of the origin, and hence the curve must extend farther to the right, that is, the curve must be skew.* But it was seen in Chapter V. that /?! is the fundamental measures of skewness. Therefore both a and c must contain f3 l as a factor. When x- is small the constant d has little effect on the * See page 57, Chapter V. INTRODUCTION TO MATHEMATICAL STATISTICS 133 slope, but for the extremities of the curve where x and hence d x~ is large the slope is reduced by a large value of d. It will be seen that d depends largely on /? 2 . The Types of Curves. W 7 e may now discuss the distinct types of curves that possess the slope properties of the general- ized slope equation. Distinct types of curves result according as the denominator, b + ex -f- d.r' 2 , has two distinct factors, two co- incident factors, or has no factors. With two distinct factors the slope equation can be written i d y x + a x + a IV v ax I + C.Y + d.\'- (r + -r) (r 2 *) where & is a constant. By the usual mathematical methods we then have k(a r a ) y = 3- Oi + -v) - - 0' 2 *) (A) >'i + r 2 ^ + r 2 where y is the constant of integration. By a simple transformation and rearrangement, this equa tion can be reduced to the form of Pearson's first type, namely (.1- ' + Exercises. 1. Carry through in detail the necessary transformations to de- termine the equation of Type I from equation (A). 2. Perform the integrations to obtain the curve of Type I. When &J and a- 2 are equal it is readily shown that m^ = m 2 and the equation takes the form of Type II : IL When one root of the denominator b + ex -\-dx~ is indefi- nitely large, that is, when d is zero, we have, from the theory of the exponential e, the third type : m - 134 INTRODUCTION TO MATHEMATICAL STATISTICS This equation may be looked upon as that of Type I with a 2 indefinitely large. The curves of Type III are especially serviceable because the equations are simple in form and convenient for computa- tion. They are the most elementary skew curves. By transforming expression (A), in a manner somewhat different from that to obtain Type I, the form of Pearson's sixth type is readily obtained. It is ' 3' = y (a- a) '"= x -*i. Type VI. Exercises. 3. Obtain the equation of Type II by direct integration from the differential equation. 4. Compare Type II with the normal curve. 5. Obtain Type III directly by integration. 6. Obtain Type III from (A). 7. Compare the shape of Type III with that of the normal curve. 8. Obtain the equation of Type VI directly from the differential equation. 10. Is Type VI geometrically distinct from Type I? When two roots are indefinitely large we have the normal curve : which is called simply "Normal" in Pearson's scheme of classifi- cation. With two coincident roots, the slope equation becomes I dy x -\- a y dx (x -\- r) 2 y This leads to the form y y () x~ve x , TyP e V. which is Pearson's type V. Exercises. 11. Derive in detail the equation of Type V. INTRODUCTION TO MATHEMATICAL STATISTICS 135 When the denominator of the slope equation cannot be factored the integration is performed by writing i dy x -j- a y'dx~ b + cx + dx 2 ' c c *-\ h0- 2d 2d c c c- d\x- -\ x + - + - d 4 ^ L> d 4<* 2 This gives x f_ (.I* 2 \ ~ m v tan o I+ ^/ ' Type IV. which is the form of Type IV. Exercises. 12. Derive in detail the equation of Type IV. 13. Derive the equation of Type IV by transformation from the equation of Type I. 14. Compare the form of the equation of Type IV to that of Type III. If y is zero in the immediately preceding equation we have Pearson's Type VII. / .r 2 \- V = -Vo I I + I V a- 2 / Type VII. The Intercepts. The intercepts made on the X-axis by the various types of curves can now be examined. The follow- ing theorem is fundamental in the theory of the intercepts of Pearson's curves: an incommensurable power of a negative number does not exist. Let N denote any negative number and ( N)P = r (cos p* -\- V-^nTsin p-ir) where V^ I is the square root of negative unity. Unless p is an integer sin p* is not zero and hence ( N)P contains V^l which has no arithmetical value. Hence powers of N which are not integral do not exist. In Type I the intercepts are a^ and a 2 . Since #j and a 2 are not integers, the curve stops at the X-axis and there are no points below that axis. Indeed, there are no negative ordinates on any of the curves. 136 INTRODUCTION TO MATHEMATICAL STATISTICS In Type II the intercepts are of the same length and numeri- cally equal to a. In Type III one intercept is a and the other is indefinitely large. In the case of the normal curve both intercepts are indefi- nitely large. In Types IV and VII there are no intercepts. In Type V one intercept passes through the origin and the other is indefinitely large. In Type VI both intercepts are positive or both are negative. Ordinarily the type of curve selected should have intercepts harmonizing with the natural limits of the range of the data. For instance, data necessarily limited in either direction should be smoothed with a curve correspondingly limited. However nearly all the curves are practically limited in range because the ordinates soon become negligible, so that the matter is not one of great importance ; tho a somewhat better fit is likely to be obtained with a curve limited in accordance with the data. Exercises. 15. Of what types is the normal curve a limiting curve? 16. Distinguish between a curve with indefinitely large intercepts and a curve with imaginary or non-existent intercepts. 17. Show that there are indefinitely more curves of Types I, VI and IV than of Types III, V, II or VII, or of the normal curve. 18. Show how Type I can be said algebraically to include Type IV. 19. Show that Types I and VI are not fundamentally distinct. 20. Show that by taking all combinations of sign into account there are three distinct classes of curve under Type I. 21. Show that there are two sub-classes under Type II according as the exponent m is positive or negative. 22. Show that there are two classes under Type III. 23. Is there more than one general form of curve under Type IV? Under type V? 24. Discuss the curves of Type VI as to the existence of sub- classes within the Type. 25. What types of these curves have asymptotes? 26. Do all the curves have a mode? 27. Find the points of inflexion for each type. The Criterion K. Since the separation into types depends primarily on the nature of the roots of the quadratic, b + ex -f- cfcr 2 , the discriminant of this quadratic constitutes a INTKonrCTION TO MATHEMATICAL STATISTICS 137 criterion of the type of curve which fits the distribution. The values of a, b, c, and d are first determined by the method of moments and then the discriminant expressed in terms of the computed expressions for b, c, and d. The formula for K. the discriminant obtained in this way is . 2 3/3, 6) ( 4 /? 2 3/^1 This formula for K is derived as follows : The differential equation \/y dy/d.r = (x -f a)/(& +f.r -f d.**) may be written (b + ex -{- dx*} dy/dx = y(x -f- a). Multiplying each side by .i'", we have ^x n (b-\-c.r + d.r*)dy \y(x + a}x n dx. On integrating the left side by parts .r n (b -f ex -f- dx 2 } y nb \ .r n * y dx (n -}- 1 ) c j ^r n y d.v (n -f 2) d j^n-f 1 y dx IJ3? ^ n + 1 rf^- ajy ^ n rf^. With the usual notation, where A*' n = j^ n >' cf^ If A' is very small at the ends of the range the first expression van- ishes and the moment equation connects the three moments /*'_ i, /*' n , On rearranging this equation ^we have Since the moment. M' O = 1 and, if the mean is taken as origin, t\ we have for n = 0, 1, 2, 3. respectively the four equations: a c= a Us 3^ 4 CM.-. od/"4 = A*4 On solving this set of equations and substituting in the differential or slope equation, we have dy + In terms of ft and & this becomes 6ft 9) 2 (5ft 6ft 9) 138 INTRODUCTION TO MATHEMATICAL STATISTICS The discriminant of the quadratic denominator is the required criterion, K. It is easily shown that For, the quadratic expression, dx* -f- ex -f b, may be written Vc* 4bd 1 ( Vc 2 4bd } x I x -\- - . Hence the character of 2d ) ( 2d ) the two factors depends on the value of the quantity (c 2 4bd). When this is zero the two factors are equal; when it is negative there are no factors, etc. Writing (c a 4bd) in the form (c*/4bd-^ 1) 4bd we have, if K = , the following classes of factors according to values of K: 4bd If K 1, (c 2 4bd) is again positive and the factors are unequal, etc. The Value of K and the Types of Curve. The following table gives the types of curves corresponding to the different values of K. K < o, i. e. negative Type I. {j3 l = o, /2o = 3 Normal Curve. Pi - o, Pi < 3 Type II. fli = o, ft, < 3 Type II. K > o < i Type IV. K = i Type V. K > i, but not indefinitely large, Type VII. K > i and indefinitely large. Type III. It is to noted that the types of curve for any given sta- tistical distribution can now be determined by strictly arithmetic methods. The only restriction on the generality of the theory of the criterion K is that the quantity x n (b + cx-\- dx*)y must vanish at both ends of the range. This condition marks the pairs of values of & and & for which no curve of the generalized differential equation can be found. The limiting values of ft and ft are ft > f ft and ft > ft/8 -f 9/2 (see Exercises 29 and 30 below). INTRODUCTION TO MATHEMATICAL STATISTICS 139 Exercises. 28. Read the explanation to Tables XXXV-XLVI in "Tables for Statisticians and Biometricians." 29.* Derive the formulas |8 n (odd) = (n-f 1) where a = (2ft 3ft 6) / (ft + 3). 30. From the computation formulas for Type II, prove that m is negative when ft < 1.8. 31. Prove from the working formulas of Type I that Type I in- cludes three sub-classes according to the signs of mi and wz 2 . Derive the criterion curve, ft(8ft 9ft-12) (4ft 3/3 a ) = (10ft 12ft 18) 2 (2 + 3) 2 * 32. Prove that ft >ft 33. Prove the relation ft > 15/80 + 9/2. 34. Show that a large value of ft for the curves derived from the generalized differential equation denotes a comparatively flat-topped curve. 35. Show that for the normal curve with a = o, we have b = ), called the gamma function, occurs in the following formulas. This function is defined by the relation r(/>) = (/> i) r(pi). If p is an integer, T(p) = \p I. If p is not an integer, T (p) (p i) (/> 2) (/> p + 2) F P where P is the remainder after subtracting a sufficient number of I's to bring p down to between 2 and i in value. The values of F (P) are given in Table XXXI of "Tables." The probable errors of K as well as of /^ and ($. 2 are given in "Tables." The derivation of the following computation formulas, ex- cept the moment formulas, is not possible without an extensive acquaintance with the calculus.* After the constants in the equation are computed the smoothed frequencies are obtained by computing the areas under the curve and between the bounding ordinates. Thus the fre- quency of the first class is the area between the ordinate x = \ and x ~ i \. Simpson's quadrature formula is ordinarily used for finding the class areas. According to this formula the area is 1/6 \ v x i +'43' x + V r-j-i \ where 3'.r-y 2 and 3'* + * are the bounding ordinates and 3' x is the mid-ordinate of the class. Formulas for the Moments. S, d. v 2 = 2.9, d( i +d). V ,=6S 4 3 ,, 2 (i+d)_ d(i +d) (2 + d). v 4 =2 4 ^ 2v s ^J2(l+rf) + ll v 2 \6(l+d) (2-frfj- 3)8, 6 * See Elderton "Frequency Curves and Correlation," C. & E. Layton, for a thoro discussion of the deriviations. INTRODUCTION TO MATHEMATICAL STATISTICS 1 The computation formulas for Type I are as follows The equation is, y = Vo where a l /m l = a 2 /m,. We have r = _ 2 W 2 and w t are given by the formulas i(r 2) i(r + 2) V)8 1 e. The constant m x is taken with the negative root when /A 3 is posi- tive and with the positive root when /x 3 is negative. jj and a 2 can be found from the relations a t + 2 = ^/ Wl = a 2 /m 2 . X ;,'"'///,'"- r(m 1 + wz, + 2) 7'4-2 The skew ness is Mode = mean * i - The formulas for Type II are as follows The equation for this type is .r 1 cr. 142 INTRODUCTION TO MATHEMATICAL STATISTICS The formulas are 9 m = 2(3 & ~3 ft N r(2w-f-2) 2m + 1 I Type III. The equation is The formulas are, N p f + l - , where /> = va. ' Mode = mean -- i Skewness = 7 Type IV. The equation is ( r 2N _ m J -r '=-' ('+?) - ' The formulas are: 3ft 6 i6(r i) ft(r 2) 2 ' I fcr 2 ^~ INTRODUCTION TO MATHEMATICAL STATISTICS 143 cos 2 < 1 -\.'r e-& -" Vo , where tan < Origin = mean -| . r i fi a ( r 2) Mode = mean Tvpe V. The equation is The formulas are : p = 4 -| ~ , 7 = (/> 2) \//x 2 ( p 3), with sign same as that of /x 3 . , 3 SK. = , ^ Origin = mean . P 2 2y Mode = mean Type VI. The equation is 3' = 3'o (* o) The formulas are : r = 6 + 3^ - 2ft r r + 2 -+ 2 4 r r + 2 __ 2 4 144 INTRODUCTION TO MATHEMATICAL STATISTICS a(<7i Origin mean 1 jU 2 r + 2 Mode = mean 2 ^ 3 r 2 Type VII. The equation is : The formulas are : 5& 9 A/" Tm Normal Curve. The equation, as was proved in Chapter VI, is A-2 N - ; y = - -. e 2a V^TTCT and the curve was discussed in that chapter. APPENDIX II BLAKEMAN, J. "On the Tests for Linearity of Regression in Frequency Distribu- tions", Biometrika, Vol. IV, pp. 332 et seq. BLAKEMAN, J. and PEARSON, K. "On the Probable Error of Mean Square Contingency", Biometrika, Vol. V , pp. 191 et seq. BOWLEY, A. L. "Measurements of Groups and Series", C. and E. Lay ton, 1903. "Relation between the Accuracy of an Average and that of its Con- stituent Parts", Jour. Roy. Stat. Soc., Dec. 1897. "The Measurement of the Accuracy of an Average", Jour. Roy. Stat. Soc., Dec., 1911. BRAVAIS, A. "Analyse matematique sur les probabilites des erreurs de situation d'un point", Memoires presentcs par divers savants a L'acadcmie Royale des Sciences de L'institutc de France, sciences matematique et physique, He serie, t. IX, 1846, p. 255. BROWN, GREENWOOD and WOOD. "A Study of Index Correlation", Jour. Roy. Stat. Soc., Feb. 1914. CAVE, BEATRICE and PEARSON, K. "Numerical Illustrations of the Variate-Difference Correlation Method", Biometrika, Vol. X, pp. 340 et seq. EDGEWORTH, F. Y. "On the Method of Least Squares", Phil. Mag., Vol. XVI, Ser. 5, 1883, pp. 360 et seq. "On Theory of Errors of Observation and the First Principles of Statistics", Camb. Phil. Trans., Vol. XIV, pp. 138 et seq. "Problems in Probability", Phil. Mag., Vol. XXII, Ser. i, 1886, PP. 374 et seq. "On a New Method of reducing Observations relating to several Quantities", Phil. Mag., Vol. XXIV, Ser. 5, 1887, pp. 222 et seq. and Vol. XXV, 1888, pp. 184 et seq. "On Correlated Averages". Phil. Mag.. Vol. XXXJ] r , Ser. 5. 1892, pp. 190 et seq. 10* (145) 146 BIBLIOGRAPHY EDGEWORTH, F. Y. "The Asymmetrical Probability Curve", Phil. Mag., Vol. XLI, 1896, pp. 90 et seq. "Representation of Statistics by Mathematical Formulas", Jour. Roy. Stat. Soc., Dec. 1898; Sept. 1809; June 1899; Mar. 1899; Mar. 1900. "The Law of Error", Cawb. Phil. Trans., Vol. XX, 1905; pp. 36-65 and 113-141. "On the Generalized Law of Error of Great Numbers", Jour. Roy. Stat. Soc., Sept. 1906. "On the Representation of Statistical Frequencies by a Series", Jour. Roy. Stat. Soc., Mar. 7907. "On the Representation of Statistics by Analytical Geometry", Jour. Roy. Stat. Soc., 1914; Feb. pp. 300-312; Mar. 415-432; May, 653- 671; June, 724-749; July, 838-852. Article on "Probability" in the Encyclopedia Brittanica, Eleventh Edition. EDGEWORTH, F. Y. and BOWLEY, A. L. "Methods of Representing Statistics of Wages and other Groups not Fulfilling the Normal Law of Error", Jour. Roy. Stat. Soc., June, 1902. ELDERTON, W. PALIN. . "Frequency Curves and Correlation", C. and E. Layton, London, 1906. ELLIS, LESLIE. "The Method of Least Squares", Camb. Phil. Trans., Vol. VIII, pp. i et seq. FISHER, R. A. "On an Absolute Criterion for Fitting Frequency Curves", Messen- ger, Vol. XLI, pp. 165-160. GALTON, FRANCIS. "Family Likeness in Stature", Proc. Roy. Soc., Vol. XL, 1886; pp. 42 et seq. "Family Likeness in Eye-Color", Proc. Roy. Soc., 1886; Vol. XL, pp. 402 et seq. GALTON, FRANCIS. "Correlations and their Measurement", Proc. Roy. Soc., Vol. XLV, 1888, pp. 135 et seq. "The most Suitable Proportions between First and Second Prizes", Biometrika, Vol. I, pp. 385 et seq. HERON, DAVID. "On the Probable Error of a Partial Coefficient", Biometrika, Vol. VII, pp. 411 et seq. "The Danger of Certain Formulae Suggested as Substitutes for the Correlation Coefficient", Biometrika f Vol. VIII, pp. 109 et seq. IMI5LIOGRAPHY 147 HOOKER, R. H. "Correlation of the Marriage Rate with Trade", Jour. Roy. Stat. Soc., Sept. 1901. "Correlation of the Weather and Crops", Jour. Roy. Stat. Soc., Mar. 1907. ISSERLIS, L. "On the Partial Correlation Ratio", Biometrika, Vol. X, pp. 391 et seq., also Vol. XL "The Application of Solid Hypergeometrical Series to Frequency Distribution in Space", Phil Mag., Vol. XXVIII, Ser. 6, 1914, pp. 379 et seq. KEYNES, J. M. "Principal Averages and the Laws of Error which lead to them", Jour. Roy. Stat. Soc., Feb. 1911. NIXON, J. W. "An Experimental Test of the Normal Law of Error", Jour. Roy. Stat. Soc., June, 1913. PEARSON, KARL. "Mathematical Contributions to the Theory of Evolution", I. "On the Dissection of Asymmetrical Frequency Curves", Phil. Trans., 1894, Vol. CLXXXV, A, part I, pp. 187 et seq. II. "Skew Variations in Homogeneous Material", Phil. Trans., 1895, Vol. CLXXXV I, A, pp. 343 et seq. III. "Regression, Heredity and Panmixia", Phil Trans., 1896, Vol. CLXXXV 1 1 A, pp. 253 et seq. IV. "On Probable Errors of Frequency Constants and on the Influence of Random Selection on Variation and Correlation", Phil. Trans. 1898, Vol. CXCI A, pp, 229 et seq. (In Collaboration with L. N. G. Filon.) V. "On the Reconstruction of the Stature of Prehistoric Races", Phil. Trans., 1892, Vol. CXCII A, pp. 169 et seq. VI. "General Selection", Phil. Trans., 1899, Vol. CXCII A, pp. 257 et seq. VII. "On the Correlation of Characteristics not Quanti- tatively Measurable", Phil. Trans., 1909, Vol. CXCV A, pp. I et seq. VIII. "On the Inheritance of Characters not Capable of Exact Quantitative Measurements", Phil. Trans., 1901, Vol. CXCV A, pp. 79 et seq. IX. "On the Principles of Homotyposis and its Relation to Heredity, to the Variability of the Individual and to that of the Race", Phil. Trans., 1901, Vol. CXCVII A, pp. 28$ et seq. 148 BIBLIOGRAPHY PEARSON, KARL. "Mathematical Contributions to the Theory of Evolution" Continued. X. "Supplement to a Memoir on Skew Variation' 1 , Phil. Trans., 190!, Vol. CXCVII A, pp. 445 et seq. XI. "On the Influence of Natural Selection on the Varia- bility and Correlation of Organs", Phil. Trans., Vol. CC, A, 1903, pp. i et seq. XII. "On a Generalized Theory of Alternative Inheritance with Special Reference to Mendel's Law", Phil. Trans., 1904, Vol. CCIII A, pp. 53 et seq. XIII. "On the Theory of Contingency and its Relation to Association and Normal Correlation", Drapers' Co. Res. Mem., Biometric Series I , Dulau & Co., London, 1904. XIV. "On the General Theory of Skew Correlation and Non-Linear Regression", Drapers' Co. Res. Mem., Dulau & Co., 1905. XV. "A Mathematical Theory of Random Migration", Drapers' Co. Res. Mem., Biometric Series II, 1906. (In Collaboration with Blakeman}. XVI. "On Further Methods of Determining Correlation", Drapers' Co. Res. Mem., Biometric Series IV, Dulau & Co., London, 1907. XVIII. "On a Novel Method of Regarding Association, etc.", Biometric Series VII, 1912, Drapers' Co. Res. Mem. "On a Form of Spurious Correlation due to Indices", Proc. Roy. Soc., Vol. LX, 1897, pp. 489 et seq. "On a Criterion that a given System of Deviations from the Prob- able in the Case of Correlated System of Variables is such that it can be reasonably supposed to have arisen from Random Sampling", Phil. Mag., Ser. 5, Vol. L, 1900, pp. 7.57 et seq. "On Lines and Planes of Closest Fit to Systems of Points in Space", Phil. Mag., Ser. 6, Vol. II, 1901, pp. 559 et seq. "On the Systematic Fitting of Curves to Observations and Measure- ments", Biometrika. I, pp. 265 ct seq. and Biometrika, II, pp. i et seq. "On the Probable Errors of Frequency Constants", Biometrika, II, pp. 273 et seq.; also Vol. IX, pp. i et seq. "Elementary Proof of Sheppard's Formulae, etc.", Biometrika, Vol. Ill, pp. 308 et seq. "On the Generalized Probable Error in Multiple Normal Correla- tion", Biometrika, Vol. VI, 1908, pp. 59 et seq. With Alice Lee "On a New Method of Determining Correlation between a Meas- ured Character A and a Character B of which only the Percentage of Cases wherein B exceeds (or falls short of) a given Intensity is recorded for each Grade of A", Biometrika, Vol. VII, 1909, pp. 96 et seq. iiY 149 PKAKSON, KARL. "Mathematical Contributions to the Theory of Evolution" Concluded. "On a New Method of Determining Correlation when one Variable "s given by Alternative and the other by Multiple Categories", Biomctrika, Vol. VII, 1910, pp, 248 et seq. ''On a Correction to be made to the Correlation ratio "n", Biomet- rika, Vol. VIII, pp. 254 et seq. ' On the Probable Error of a Coefficient of Correlation as found from a fourfold Table", Biotnetrika, Vol. IX, pp. 22, et seq. "On the Measurement of the Influence of 'Broad Categories' on Correlation", Biomctrika, Vol. IX, pp. 166, et seq. PEARSON, K. (Editor). "Tables for Statisticians and Biometricians", Cambridge University. Press, 1914. PEARSON, K. and HERON, DAVID. "On Theories of Association", Biomctrika, Vol. IX, pp. 158 et seq. PERSONS, WARREN. "The Correlation of Economic Statistics", Amer. Stat. Assoc., Vol. XII, Dec. 1910. SHEPPARD, W. F. "On Application of the Theory of Error to Cases of Normal Dis- tribution and Normal Correlation", Phil. Trans., 1899, Vol. CXCII, A, pb. loi ct seq. "On the Calculation of the most probable Values of the Frequency Constants for Data arranged according to equi-distant Divisions of a Scale", Proc. Lon. Math. Soc., Vol. XXIX, pp. 353-380. "On the Use of Auxiliary Curves in Statistics of Continuous Variates", Jour. Roy. Stat. Soc., Sept. 1900. SNOW, E. C. "The Application of the Method of Multiple Correlation to the Estimate of Post Censal Population", Jour. Roy. Stat. Soc., May, 1911. SPEARMAN, C. "The Proof and Measurement of Association between Two Thing',", Amcr. Jour, of Psychology, Vol. XV, 1904, pp. 88 et seq. "Dem jnstration of Formulae for True Measurement of Correla- tion", Amcr. Jour, of Psych., Vol. XVIII, 1907, pp. 161 et seq. "A Foot-rule for Measuring Correlation", Brit. Jour, of Psych., Vol. If, 1906 pp. RT et seq.; also Vol. II, part v, pp. 107-108. "Correlation calculated from Faulty Data", Brit. Jour, of Psych., Vol. Ill, /y/o, pp. 271 et seq. "STUDENT". "The Elimination of Spurious Correlation due to Position in Time or Space", Biomctrika, Vol. X, pp. 799 et seq. I5O BIBLIOGRAPHY YULE, G. U. . "On the Significance of Bravais' Formulae for Regression, etc., in the case of Skew Correlation", Proc. Roy. Soc., Vol. LX, 1897, pp. 477 et seq. "On the Association of Attributes in Statistics", Phil. Trans., 1900, Vol. CXCIV, A, pp. 257 et seq. "On the Theory of Consistence of Logical Class Frequencies and its Geometrical Representations", Phil. Trans., 1901, Vol. CXCVII, A, pp. 91 et seq. "On the Theory of Correlation for any Number of Variables treated by a New System of Notation", Proc. Roy. Soc., Ser. A, Vol. LXXIX, 1907, pp. 182 et seq. "The Application of the Methods of Correlation to Social Economic Statistics", Jour. Roy. Stat. Soc., Dec. /pop. "On Interpretation of Correlation between Indices or Ratios", Jour. Roy. Stat. Soc., June, 1910. "On the Methods of Measuring Association between Two Attrib- utes", Jour. Roy. Stat. Soc., May, 1912. (<*) 14 DAY USE RETURN TO DESK FROM WHICH BORROWED LOAN DEPT. This book is due on the last date stamped below, or on the date to which renewed. Renewed books are subject to immediate recall. 11 APR'59JB T '-- ' ' ' "' f,O MAR 2 8 1959 270ct'60BM 1 29W64 pG LD 21A-50m-9,'58 (6889slO)476B General Library University of California Berkeley 4 Apr v REC'D LD MAR 3 1 1961 REC'D LD THE UNIVERSITY OF CALIFORNIA LIBRARY