31 A FAST ALGORITHM FOR AUTOMATIC CLASSIFICATION R. T. DATTOLA: Department of Computer Science, Cornell University, Ithaca, New York An economical classification process of order n log n (for n elements), which does not employ n-square procedures. Conversion proofs are given and possible information retrieval applications are discussed. Many methods exist for ordering or classifying th<;J elements of a file. The elements are usually clustered into groups based on the similarities of the attributes of the elements. In information retrieval, the elements are frequently documents, and the attributes are words or concepts char- acterizing the documents. Classification of document files may be divided into two basic categories: 1) an a priori classification already exists and each document is placed into the cluster whose centroid is most similar to that document; 2) no a priori classification is specified and clusters are formed only on the basis of similarities among documents. Classification schemes that fall into the first class are very common and often involve manual methods. For example, new acquisitions of a library are classified by placing them into the clusters of a standard a priori classification. Problems of the second type are usually more difficult to handle, and automatic or semi-automatic methods are often used. Methods of this type are widely used in statistical programs, but the number of elements in the file is limited to several hundred, or at most, a few thousand items. In information retrieval applications, the number of ele- ments may approach several hundred thousand or even a million docu- ments, as in the case of a large library. In the present study, a method is described which is suitable for classification of very large document collections. 32 Journal of Library Automation Vol. 2/1 March, 1969 THE N2 PROBLEM Current methods of automatic document classification usually require the calculation of a similarity matrix. This matrix specifies the correla- tion, or similarity, between every pair of documents in the collection. Thus, if the collection contains N documents, N 2 computations are re- quired for calculation of the similarity matrix; however the similarity matrix is often symmetric, so the number of computations is reduced to N2/2. This immediately poses two serious problems: the storage space nec- essary to store the matrix increases as the square of the number of docu- ments, and the time required to calculate the matrix also increases quad- ratically. Fortunately, document-document similarity matrixes are normally only about ten percent dense, and only the non-zero elements need be stored ( 1). However, as N increases, auxiliary storage must eventually be used, and although this solves the space problem, it also magnifies the time problem. To illustrate the magnitude of this problem, suppose that it takes one hour of computer time to classify a one-thousand document collection. Then for N = 104, the time is approximately one hundred hours, and for N = 106, the time needed is about 120 years! The classification scheme described in this paper is an adapt~tion of the one proposed by Doyle, and the time required is of the order of N log N (2) . For example, assum- ing the logarithm has base 10, and the time required for a one-thousand document collection is again one hour, then for N = 104 the time is 13 hours, and for N = 106, the time required is about 83 days. DOYLE'S ALGORITHM The N2 problem is avoided in this classification scheme, because a similarity matrix is never computed. Assume the document set is arbitrarily partitioned into m clusters, where S, is the set of documents in cluster j, Associated with each set S, are a corresponding concept vector C1 and frequency vector F,. The concept vector consists of all the concepts occurring in the documents of S;, and the frequency vector specifies the number of documents in S; in which each concept occurs. Every concept in C; is assigned a rank according to its frequency; i.e., concepts with the highest frequency have a rank of 1, concepts with the next highest frequency receive a rank of 2, etc. Given an integer b (base value), every concept in C; is assigned a rank value equal to the base value minus the rank of that concept. The vector of rank values is called the profile P, of the set SJ. Tables 1 and 2 illustrate the concept and frequency vectors, and the corresponding profiles for a sample document collection (base value=6). , Starting from a partition of the document set into m clusters, the profiles are generated as described. Every document di in the document space is now scored against each of the m profiles by a scoring function g, where g ( dt, P1) = the sum of the rank values of all the concepts from cL Fast Algorithm for Automatic Classification/DATTOLA 33 which occur in C,. Tables 3 and 4 show the results of scoring the docu- ments in the sample collection against the profiles from Table 1 (cut- off= 10). Table 1. Concept Vectors dl d2 da d-4 d5 d6 d1 C1 CI C1 C1 C1 Cs C6 C2 C2 C7 C2 Cs Cs C5 C4 Cs Cs C5 Cn Table 2. Initial Clusters, Profiles, and Frequencies s~ c1 F. pl s2 c2 F2 p2 Sa Cs Fs - - dl C1 3 5 d2 C1 2 5 d6 Cs 1 da C2 1 3 d4 C2 2 5 d7 . C6 1 d5 Cn 1 3 Cs 1 4 Cs 1 C1 1 3 C4 1 4 Cs 2 4 C5 2 5 Table 3. Document Scoring Document Profile of Highest Score Score ~ 2 m ~ 2 w ~ 1 ~ ~ 2 w d5 1 9 ds 3 5 ~ 3 ro Table 4. Clusters Resulting from Document Scoring S1' S2' Sa' L da d1 d1d5 d2 d6 d4 Pa 5 5 5 34 Journal of Library Automation Vol. 2/1 March, 1969 Given a cut-off value T, a new partition of the document set into m+1 clusters is made by the following formula : S/ = [d.ig(d.,PJ)~g(d.,Pk) and g(d.,P,)'=::,..T, for k=1, . . . ,m]. Thus, S/ consists of all the documents that score highest against profile PJ, provided that the score is at least as great as T. In cases where a document scores highest against two or more profiles, say Pr, .. . , Pr, 1 " the following tie-breaking rule is used: if d.eSJ and, a) f=rk, 1 L k L n, then d. is assigned to ( Srk) '; b) j=r,, for 1 L k L n, then c1 is arbitrarily assigned to (Sr 1 ) '. Those documents which do not fall into any of the m clusters S/ are called loose documents, and they are assigned to a special class L . The process is now repeated after replacing P, by P/. The iteration continues until P; satisfies the termination condition, which states that P/=P1 for f=1, . .. ,m, i.e., the profiles are unchanged after two consecutive iterations. SATISFACTION OF TERMINATION CONDITION A) Non-Convergence of Doyle's Algorithm Doyle's algorithm as described is not guaranteed to tenninate. To illus- trate this, consider the following document collection: dl d2 da d4 du d6 dr ds d9 1 1 3 3 1 2 2 2 1 2 5 7 4 5 3 3 4 2 3 6 8 7 6 4 4 5 4 4 11 9 8 7 9 7 6 5 5 12 12 8 7 6 11 9 8 7 10 10 10 Let S1=[d1-d6], S2=[d1-d9], and let P1=pro:6le of sl~ and P2= profile of s2. The two profiles are as follows (base value=7) : dl- d6 dr- d9 Concept Frequency Profile Concept Frequency Profile 1 3 5 1 1 4 2 2 4 2 3 6 3 4 6 3 1 4 4 3 5 4 3 6 5 3 5 5 2 5 Fast Algorithm for Automatic Classification/DATTOLA 35 Concept Frequency Profile Concept Frequency 6 2 4 6 2 7 3 5 7 3 8 2 4 8 2 9 2 4 9 1 ro o ro 3 11 2 4 11 0 12 2 4 12 0 Profile 5 6 5 4 6 13 0 13 1 4 Now assume that T = 0, and partition the document set by the for- mulae: S1'=[~jg( ~,P1)::::,g( dt,P2)] and Ss'=[dtjg( dt,P2)::::,g( dt,Pl) ]. The results are summarized in the following table: g(dt,Pl) g(~,P2) dl 29 25 d2 22 14 ds 23 19 d4 20 21 d5 19 20 da 19 20 d1 28 37 ds 27 39 d9 28 42 Therefore, S1'=[d1-dal and S2'=[d4-d9]. According to Doyle's algorithm, P1 is replaced by P1' and P2 by P2'. The new profiles are: d1-ds d4-d9 Concept Frequency Profile Concept Frequency Profile 1 2 6 1 2 3 2 1 5 2 4 5 3 2 6 3 3 4 4 1 5 4 5 6 5 2 6 5 3 4 6 1 5 6 3 4 7 1 5 7 5 6 8 1 5 8 3 4 9 1 5 9 2 3 10 0 10 3 4 11 2 6 11 0 12 2 6 12 0 13 0 13 1 2 36 Journal of Library Automation Vol. 2/1 March, 1969 Now the document set is again partitioned and the results are: g(i1,P1) g(t1,P2) dl 34 22 d2 29 11 da 27 17 d4 21 20 d5 22 17 do 21 18 d1 31 32 ds 31 33 do 32 34 Therefore, S1'=fd1-da] and S2'=fd1-dol. These are the original sets, so that the algorithm will never terminate for this example. B) Termination of Modified Algorithm Although Doyle's algorithm is not guaranteed to terminate, Needham proved that similar types of iterative methods are guaranteed to terminate in a finite number of steps ( 3). A small change in Doyle's method produces an algorithm that is guaranteed to terminate. The modification occurs after the calculation of the S/. Instead of automatically replacing the old P1 by P;', the following condition must also be satisfied: 'l g( c1,P/) > 'l g( t1,P1 ) i£S/ i£S/ If the above condition is not satisfied, P1 is left unchanged. Before proving that this new algorithm is guaranteed to terminate, it is desirable first to make the algorithm more general by allowing overlap between the clusters. The following theorem proves the termination of a method which allows overlapping clusters. Theorem: Let the subscript n designate the nth iteration. Let D repre- sent the document space and let Po,1, ... , Po,m represent m initial profiles corresponding to an arbitrary distribution So,1, ... , So,m of documents in D . Given a cut-off value T, the nth iteration is defined as follows : 1). Generate the sets Sn,l, ••• , Sn,m and Ln by Sn,J=[i1lg( d~,,Pn-1,1 f:""·T] Ln= (loose documents) { p n,J if, 'l g ( t1,P n,f) ? 'l g ( t1,P n-l,J) 2), Let P n,;= 't£Sn,J 't£Sn,J Pn-l,J otherwise This algorithm is guaranteed to terminate in a finite number of iterations, where termination occurs when Pn,1=Pn-l,J for all f. Proof: Extend the document spaceD to a new document spaceD# con- Fast Algorithm for Automatic Classification/DATTOLA 37 taining m distinguishable copies of every document in D. Also, add the condition that Sn,J can never contain more than one copy of each document. Clearly, any Sn,J defined on D# in this manner can also be represented on D as defined in the theorem. Con- versely, any Sn,J defined on D can also be represented on D# as defined above. Thus, it suffices to prove the theorem on D# under the added condition. Define a function F ,., which will be shown to be monotone increasing in n, by the following: m F,. = l F n,J+ T•Zn, where i=l F,.,J= lg(cL,Pn-t,J) and if.Sn,J Z,. = number of documents in L,.. After step 2 of the iteration, F,. is replaced by F ,.', where ( F n,J)' = l g ( dt,P n,J) • if.Sn,J If for any j, Pn,J =F Pn-t,J, then (Fn,J )' > Fn,J (this statement is not necessarily true in Doyle's algorithm) and therefore F ,.' > F n· If . termination occurs; i.e., Pn,J = Pn-t,J for all j; then Fn' = F,.. For the n + lth iteration, m F,.+l = l Fn+t,J + T•Zn+t, where i=l Fn+l ,J = l g(dt,Pn,J). if.Sn+z,J Consider the relation between the contribution of cL to F,.' and Fn+t, and note that each cL (where copies of a document are distinct) contributes once and only once to both F,.' and Fn+t· This relation is summarized in the following table: DocumentcL a) was assigned to Sn,J and now 1) to s,.+l,J (cL did not change clusters) 2) to Sn+l,k, k =F f ( cL did change clusters) 3) to Ln+t b) was assigned to L,. and now 1) to Ln+t 2) Sn+l ,J Relation between contribution of cL to F,.+t and F,.' > (g( dt,Pn,J) (g(cL,Pn,J)T, where 0..::::::~1 "·1·' - LT otherwise 4) if any Sn,; contains fewer than two documents, then S,.,; is elimi- nated, thereby reducing the number of clusters by one. The advantages of this method over the one defined in the theorem are discussed in the present section; the disadvantage is, of course, that termination is not guaranteed. To show this, note that conditions 1) and 2) above are equivalent to the termination condition in Doyle's algorithm, since in Doyle's method Pn,J always corresponds to the new partition S,.,,, and S,.,1 = s:J (no overlap is allowed). Also, if a = 0 in condition 3, then Tn-t,< = Hn-t,t. Thus, only those documents c1 that score highest against Pn-1,/, where Hn-u::::::... T, are assigned to Sn,J. Therefore, with a= 0 this method is equivalent to Doyle's algorithm. The first two modifications are implemented to improve the efficiency of the program. Although convergence is no longer guaranteed, all the experiments tried so far have in fact always terminated. Programs without these two modifications run about twice as slow. Also, in cases where the overlap is not too high ( s:.1=Sn,J), the new termination condition is usually equivalent to the one used in the theorem. That is, when s:.1=S:+l,J, then very often Sn,J=Sn+t,J. Fast Algorithm for Automatic Classification/DATTOLA 39 The third modification does not improve efficiency, but it allows a more flexible, and intuitively, a more desirable method for creating overlap. The algorithm described in the theorem assigns a document d. to a cluster Sn,J if g( dt, Pn-l,J )-::::...T. This has two major disadvantages: 1) the overlap cannot be increased independently of the number of loose documents; increasing the overlap by lowering T in general decreases the percentage of loose documents; 2) the difference between d.'s highest score and d/s second highest score is ignored; e.g., if T=50, g(ct,P~)=200, and g(ct,P2)=50, then ct is assigned to both sl and s2. The first problem decreases the flexibility of the algorithm, since the amount of overlap and percentage of loose documents cannot be varied independently. The example in the second part illustrates the other prob- lem. It seems desirable that a document should be assignable to two or more clusters when it scores equally (or almost equally) as high against all of them. The previous method does not take this fact into account. In the new algorithm, documents are assigned to more than one cluster on the basis of how close the score is to the highest score, relative to the cut-off value T. The parameter a determines how close to the highest score the other scores must be. When a.;_O, no overlap occurs, while a=! generates the maximum amount of overlap. With a=l, the formula reduces to Tn-l,<=T; hence, it is the same definition of Sn,J as in the theorem. The last modification increases the efficiency of the program, and also avoids forming clusters around documents which should be classified as loose. When S,,1 contains only one document, and that document is contained in no other clusters, then it has the same status as a loose document. EXPERIMENTAL RESULTS The algorithm described in the preceding section is used to cluster the 82-document ADI collection and the 200-document Cranfield word stem collection ( 4). The results of the classification indicate three important problems: 1) the scoring function g tends to give higher scores to docu- ments containing a larger number of concepts; thus, many of the docu- ments containing very few concepts are classified as loose; 2) the docu- ments do not move freely enough from one profile to another; i.e., the final clusters are quite similar to the initial ones; 3) the initial clusters cannot be chosen arbitrarily. The Scoring Function The first problem is due to the fact that g scores a document ct against a profile P1 by simply adding up the rank values of all the concepts in d, which appear in P1• If tL contains a larger number of concepts than d~c, the chances are greater for d; to receive a higher score. Figure 1 is a plot of the score of the document against its final profile vs. the number of con- 40 Journal of Library Automation Vol. 2/1 March, 1969 -G) ... 0 CJ (/) 8 0 • 1 c1uster t Cluster 4 No. of concepts /document Fig. 1. Initial Scoring Function . Q) ... 0 CJ (/) 25 5 • _jL.----"~•r--C I uster 3 ,_ .. 0 10 20 30 40 50 No. of concepts/document Fig. 2 Modified Scoring Function. Fast Algorithm for Automatic Classification/DATTOLA 41 cepts in the document for one of the ADI runs. Although there are a few exceptions, the graph indicates that the documents with a larger number of concepts generally receive higher scores. In fact, the average number of concepts in a loose document is eleven, while the average number of concepts per document for the entire collection is twenty. The solution to this problem is to weight the score inversely by the number of concepts in the document. The obvious answer is to divide the score by the number of concepts, but this overcompensates and gives many of the smaller documents the highest scores. Dividing by the square root of the number of concepts in the document does not solve the original problem; i.e., larger documents give higher scores. Satisfac- tory results are obtained when the score is divided by ( # of concepts per document) 718• Figure 2 represents the same ADI sample as Figure 1, except that the new scoring function h=g/(# concepts per document) 718 is used. Unlike the function g, h seems to be independent of the num- ber of concepts in the document. Movement of Documents The second problem is clearly indicated by examination of the results of the classification. Table 5 shows the initial and final clusters for the ADI collection. The problem occurs because the documents tend to "stick" to the clusters that they are already in. This problem is solved by a method similar to that used by Doyle. Cluster 1 2 3 4 5 6 7 Loose Table 5. Final Results of ADI Classification Initial Documents 1-12 13-24 25-36 37-48 49-60 61-71 72-82 Final Documents 1- 11, 13, 21, 30, 33, 34, 40, 43, 51, 68 3, 10, 13 - 24, 26, 33, 34, 53, 69, 79 9, 11, 13, 20, 22, 23, 25 - 28, 30 - 34, 36, 47, 51, 55, 65, 75 4, 7-9, 14, 20, 30, 37- 48, 51, 69 1, 5, 7, 20, 30, 32, 45, 47, 51 - 53, 55 - 59, 79, 80 2, 9, 27, 30, 47, 51, 61, 62, 64-71 10, 40, 51, 72 - 75, 77 - 81 12, 29, 35, 49, 50, 54, 60, 63, 76, 82 42 I ournal of Library Automation Vol. 2/1 March, 1969 During the first few iterations, documents should be allowed to move freely from cluster to cluster, until a nucleus is formed within each clus- ter. The nucleus consists of those documents that are most highly cor- related to one another. Once the nucleus is formed, these documents will probably not move from their present clusters. Clusters can be forced to contain only very highly correlated documents by raising the cut-off value T, assuming that documents with the highest scores are most similar to the other documents in the cluster. This assumption is investigated later. However, raising the cut-off value results in a larger number of loose documents. This is resolved by repeating the classification for a lower value of T, but using the clusters from the first classification as the initial clusters. This creates the problem of how to determine the initial value of T, and how much to decrement it when the classification is repeated using as initial clusters the results of the first classification. The initial value of T should be high enough so that only those documents which score very highly against profile P1 are assigned to SJ. One method of achieving this is to pick T so that the clusters after the first iteration average q docu- ments, where q is small compared to the total number of documents. In the experiments run so far, q is arbitrarily set at 4. After termination of the first classification, a nucleus is formed within each cluster. T is now chosen so that a certain percentage of the loose documents are assigned to clusters after the first iteration of the second classification. Assuming it is desirable to have approximately x percent of the documents loose after the final clusters are formed, two approaches are possible: 1) T is lowered far enough so that only x percent of the documents remain loose after the first iteration; thus, after termination of the second classification, the clusters represent the final results; 2) T is lowered just enough to allow a certain percentage of the loose documents to be assigned to clusters after the first iteration; thus, the classification is repeated until approximately x percent of the documents remain loose. Experiments performed using both methods indicate that the second approach allows greater control of the loose documents, with only slightly greater execution times. After the first classification, a large proportion of the documents still remain loose. Therefore, if x is not too high, method 1) decreases T by a large amount. This injects many new documents into the clusters, and several iterations are necessary before termination occurs. Also, T is chosen so that the percentage of loose documents is x at the end of the first iteration, but it is impossible to know beforehand the percentage of loose documents after the final iteration. In general, the more iterations, the more the final percent varies from the percent after the first iteration. In method 2), T is lowered just enough to allow a fairly small percentage ( 20% in the present experiments) of the loose documents to be assigned to clusters. This normally results in only a few Fast Algorithm for Automatic Classification/DATTOLA 43 iterations before termination occurs; therefore, the final percent of loose documents does not vary much from the percent loose after the first iteration. The ADI collection is reclassified using the procedures described above, where it is desired that about 25% of the documents remain loose. Once again seven initial clusters are used, and the initial value of T is cal- culated to be 28.2 so that the clusters after the first iteration average four documents. However, in this case cluster 3 is assigned ten documents, while clusters 1,5, and 6 contain only one document. Thus, these three clusters are eliminated, and the documents within them become loose. Mter termination occurs, the final clusters are used as initial clusters for the next classification, where T is set to 19.1. The process is repeated again for T = 16.8, and after termination 17% of the documents remain loose. Table 6 shows the final results of this classification. Compared with Table 5, many more of the documents have moved from their initial clusters. Table 6. Final Results of New ADI Classification Cluster 1 2 3 4 5 6 7 Loose Initial Clusters Initial Documents 1-12 13-24 25-36 37-48 49-60 61-71 72-82 Final Documents 3, 5, 9, 10, 14-17, 20- 28, 30, 34, "37, 43, 45, 48, 53, 57 - 59, 64, 68, 69, 72, 79, 80 1, 2, 5, 6, 8, 11, 13, 20, 21, 24, 27, 28, 30, 36, 39, 41, 43, 47, 51, 53, 55, 56, 58, 61, 62, 65 - 68, 70, 71 79, 80 7, 31, 42, 44, 46 4, 9, 19, 32, 40, 51, 73 - 75, 78, 81 12, 18, 29, 35, 38, 49, 50, 52, 54, 60, 63, 76, 77, 82 In the present study, the initial clusters are determined by assigning the first p (or possibly p+ 1) documents to cluster 1, the next p ( p + 1) to 44 Journal of Library Automation Vol. 2/1 March, 1969 Table 7. Score vs. Average Correlation for ADI Classification Cluster 1 Cluster 2 Document Score Avg. Corr. Document Score Avg. Corr. 25 19.1 .08 8 18.7 .09 5 19.6 .12 5 18.8 .12 64 20.1 .10 20 18.8 .12 23 20.2 .13 68 18.9 .10 27 20.3 .10 2 19.2 .12 34 20.6 .11 70 19.2 .13 15 20.6 .11 39 19.2 .14 37 20.7 .14 28 19.3 .11 48 20.8 .12 58 19.4 .12 58 20.9 .12 36 19.5 .14 28 21.0 .12 61 19.6 .11 53 21.0 .14 56 19.6 .14 20 21.0 .14 66 19.7 .12 68 21.1 .12 67 19.9 .12 80 21.2 .14 80 20.0 .14 57 21.2 .13 43 20.0 .14 59 21.3 .15 33 20.0 .15 14 21.4 .13 21 20.0 .15 16 21.5 .13 11 20.0 .14 79 21.5 .15 65 20.0 .15 43 21.6 .16 27 20.0 .13 24 21.6 .15 71 20.1 .14 69 21.7 .14 41 20.2 .15 26 21.7 .16 79 20.4 .16 17 21.8 .15 24 20.4 .15 72 21.8 .17 13 20.5 .15 21 22.0 .17 51 20.6 .12 3 22.0 .17 53 20.7 .16 9 22.1 .17 6 21.0 .18 30 22.2 .17 62 21.4 .18 22 22.3 .17 55 21.5 .17 45 22.4 .18 1 21.7 .20 10 22.4 .17 30 21.8 .18 Cluster 3 Cluster 4 Document Score Avg. Cor1·. Document Score Avg. Co1'1'. 31 31.0 .21 32 24.5 .10 46 31.2 .05 51 25.0 .15 44 31.3 .14 74 26.0 .16 7 31.8 .24 4 26.3 .18 42 33.2 .28 75 26.4 .16 9 26.5 .19 19 26.9 .17 73 27.0 .19 78 27.1 .18 40 27.4 .24 81 27.6 .21 Fast Algorithm for Automatic Classification/DATTOLA 45 Table 8. Score vs. Average Correlation for Cranfield Classification Cluster 1 Cluster 2 Document Score Avg. Corr. Document Score Avg. Corr. 26 22.0 .12 38 19.9 .12 6 22.3 .13 97 20.2 .13 7 22.4 .13 15 20.3 .11 117 23.0 .14 1 20.3 .13 2 23.1 .14 34 20.4 .13 121 23.2 .15 145 20.4 .14 13 23.3 .15 171 20.8 .12 19 23.4 .16 172 20.9 .13 60 23.7 .15 30 20.9 .14 23 23.8 .17 4 21.1 .15 18 23.8 .17 140 21.1 .15 44 24.1 .18 72 21.2 .15 183 24.2 .17 138 21.3 .15 116 24.3 .18 143 21.3 .15 128 24.5 .18 141 21.3 .14 61 24.6 .18 27 21.6 .13 9 24.6 .18 36 21.7 .13 197 24.7 .19 157 21.7 .17 16 24.7 .20 59 2!.8 .16 198 24.7 .17 156 21.8 .16 3 24.8 .18 200 21.9 .16 25 24.9 .20 32 22.0 .18 28 25.0 .21 137 22.4 .15 115 25.1 .21 29 22.5 .19 58 25.6 .21 148 22.5 .17 181 25.6 .20 . 57 22.8 .18 56 25.9 .21 128 22.8 .15 160 26.6 .23 44 23.1 .19 31 23.2 .19 139 23.9 .19 56 23.4 .18 160 24.3 .18 58 25.3 .21 Clustet· 3 Document Score Avg. Corr. 179 31.4 .19 154 32.5 .24 79 32.8 .27 133 32.9 .29 134 33.4 .28 77 33.7 .27 132 33.8 .32 78 34.1 .30 76 34.3 .34 74 34.4 .34 75 34.5 .36 46 ]oumal of Library Automation Vol. 2/1 March, 1969 cluster 2, . . . , and the final p to cluster m, where p = (total number of docU'ments) I m. Since the nucleus of each cluster depends quite strongly on the initial clusters, it is not surprising that different initial clusters lead to different results. If the initial clusters are chosen at random, it is unlikely that the documents within each cluster are very similar. Thus, the nucleus of each cluster might not be very tight. This problem is solved by insuring that the initial clusters contain at least a few documents that are highly con-elated. In the ADI and Cran- field collections, the order of the documents is such that many adjacent documents are quite similar; therefore, most of the initial clusters contain a few highly con-elated documents. In collections where the order of the documents is random, a simple, fast, clustering scheme can be used to determine the initial clusters. This type of an algorithm need only perform document-document con-elations within a fraction of the docll'ment space, and therefore should not take up much time. Evaluation of Results The assumption was made earlier that those documents of a cluster S; that score highest against the conesponding profile P1 are most similar to the other documents in the cluster. The phrase "most similar" is used to mean "con-elate most highly", where a standard con-elation function is used. Table 7 compares the score of each document to the average correlation ( unweighted cosine function) of each document with every other document in the cluster. The documents are arranged in ascending order by scores, and hopefully, the con-elations will also appear in ascend- ing order. As the table indicates, there is a strong tendency for the higher scores to conespond to the higher correlations. Table 8 illustrates the same results for three out of seven final clusters from the Cranfield collection. So far nothing has been said about how to choose the base value that is used to compute rank values. This integer has an important effect on the type of clusters produced. Recall that the rank value of a concept equals the base value b, minus its rank. Suppose a cluster S1 contains four documents d1-d4, and a total of twenty different concepts. The lowest possible rank value for any concept=b-4, since 4 is the lowest possible rank. If h=20, then the lowest rank value is 16, while if h=5, the lowest rank value is 1. Consider a document d. which is the same as d1 except for one concept, and assume this concept does not occur in P,. With h=20, g( d.,P1) is between 16 and 20 points less than g( d1,P1 ); with h=5, g( d.,P1 ) is only 1 to 5 points less. Since large clusters have profiles containing many concepts, the chances of a random document d, having concepts in the profile of a large cluster are greater than the chances of d. having concepts in the profile of a small cluster. Therefore, if b is high, d. will score much lower against the profile of the small cluster, and large clusters will tend to capture all the remaining loose documents at the expense of the smaller clusters. Fast Algorithm for Automatic Classification/DATTOLA 47 Experimental results support this hypothesis, i.e., a large base value pro- duces a few clusters with many documents, and many clusters with only a few documents. If, on the other hand, b is set so that the lowest rank value in an aver- age cluster is 1, then there is a tendency for small clusters to get larger and large clusters to get smaller. In smaller than average clusters, all the rank values are high, since there are only a few different ranks. In larger than average clusters, the rank value as defined might become zero or even less than zero. In these cases, it is redefined to be 1, but then it is possible for many concepts to have a rank value of I . Thus, a docu- ment often scores higher against the profiles of smaller clusters. The results of the Cranfield classification clearly indicate the ability of a document to score higher against profiles of smaller clusters. During the classification, nine clusters are generated, and cluster 9 starts to grow much larger than average (average = 22 documents). It keeps growing until it contains 27 documents, and then it starts to oscillate. The follow- ing numbers indicate the number of documents in cluster 9 on successive iterations: 27, 21, 34, 17, 56, 01 Thus, cluster 9 is eliminated. The same thing happens to cluster 8 on the next few iterations. Although this tends to keep the size of the clusters somewhat uniform, it is not desirable to throw away a cluster which might contain many highly correlated docu- ments. One solution which might be implemented is to split up large clusters into several smaller ones; i.e., classify the documents , within a single cluster. If the number of documents in the cluster is not too large, it might be practicable to use an N 2 algorithm to do this. CONCLUSION The classification algorithm that has been described in the preceding two sections requires the following parameters as input: 1) maximum number of clusters desired; 2) approximate percentage of loose documents desired; 3) decision on whether or not loose documents should be "blended" into the nearest cluster at the end of the classification; 4) amount of overlap desired. The first parameter specifies the number of initial clusters that are formed. If no clusters are eliminated during the evaluation, then the maximum number are actually generated. The experiments run so far indicate that the number of clusters produced is usually only about 60% of the maximum. The next two parameters determine the "tightness" of the final clusters; the higher the percentage of loose documents, the tighter the clusters. If no loose documents are desired, parameter b can be set to 0, but very low percentages increase the running time of the program. Almost iden- tical results are obtained in less time by specifying about 15% loose, and then asking for all loose documents to be assigned to the cluster to which they score highest. 48 Journal of Library Automation Vol. 2/1 March, 1969 The last parameter determines the amount of overlap. This number corresponds to a in the formula T _ fHn-1,•-a • (Hn-1,,-T), if Hn-1,t>T "· 1 ·' - LT otherwise which was mentioned under Implementation. When a = 0, no overlap is produced, and with a = 1, the maximum amount of overlap is produced. The actual percentage of overlap for a given value of a depends on the collection, but results indicate that 10% overlap for a = .4, and about 20% for a= .6. Although the algorithm is not guaranteed to terminate, convergence has always been obtained in practice. In order to prevent the program from looping in cases of non-convergence, the algorithm can be modified to permit a maximum of n iterations, whether or not convergence is ob- tained. The results indicate that clusters change very little after about four or five iterations, so that this modification would not make much difference in the final clusters. The true evaluation of the final clusters can only be made by actually performing two-level searches on the clustered document space. However, the algorithm is sufficiently general to allow for the evaluation of many different types of clusters. REFERENCES 1. Jones, S. K.; Jackson, D.: "Current Approaches to Classification and Clump-Finding at the Cambridge Language Research Unit," The Com- puter Journal, 10 (May 1967). 2. Doyle, L. B.: Breaking the Cost Barrier in Automatic Classification, SDC Paper SP-2516 (July 1966). 3. Needham, R. N.: The Termination of Certain Iterative Processes. The Rand Corporation Memorandum RM-5188-PR (November 1966). 4. Salton, G.; Lesk, M. E.: Computer Evaluation of Indexing and Text Processing, Information Storage and Retrieval; Report ISR-12 to the National Science Foundation, Section III. (Ithaca, N.Y.: Cornell Uni- versity, Department of Computer Science, August, 1967).