College and Research Libraries FREDERICK G. KILGOUR Symbol-Manipulative Programing for Bibliographic Data Processing on Small Computers Drawing upon experience in the processing of bibliographic data on a small computer, the author makes suggestions for appropriate pro- grams for production runs. Subfects covered by these suggestions in- clude programing language, procedures of program preparation, cod- ing and flagging, field length, and hints on effective program appli- cation. RELATIVELY LITTLE has been writ- ten on symbol-manipulative program- ing, and papers that have appeared discuss "high-lever' languages to be run on large machines. 1 Furthermore, pub- lications in the general field of language- data processing are concerned, for the most part, with machine translation or information retrieval from natural lan- guage-two procedures which require large computers. However, in 1963 Don S. Culbertson proposed that COBOL should be the standard computer lan- guage for library data processing, 2 and in the same year I. H. Pizer, D . R. Franz, and Estelle Brodman raised objections to 1 "ACM Conference on Symbol Manipulation, May 20-21, 1960," Communications of the ACM, III (April 1960), 183-234; "Design, Implementation and Appli- cation of IR-Oriented Languages," in " Papers ..• at ... Princeton, N.J., October 20-21, 1961," Com- munications of the ACM, V (January 1962), 8-46; [Symposium on Symbolic Languages in Data Proc- essing, Rome, 1962.] Symbolic Languages in Data Processing (New York: Gordon and Breach, 1962), p. 114-85 ; Philip M. Sherman, Programming and Coding Digital Computers (New York: John Wiley and Sons, [1963J). p. 294-326; Daniel G. Bobrow and Betram Raphael, "A Comparison of List-Processing Computer Languages," Communications of the ACM, VII (April 1964), 231-40 . 2 Don S. Culbertson, "Another Tower of Babel?" Library Journal, LXXXVIII (March 1963), 940 -943. Mr. Kilgour is Associate Librarian, Yale University. Culbertson's plea.3 These two discus- sions appear to be the total literature on programing for bibliographic data processing. This paper is based on, and will report experience in programing the processing of bibliographic information on a small computer; namely, an IBM 1401 having a 4 K core, two tape drives, and advanced programing features. Most of the programs are production programs and are run on a daily production sched- ule, but some are run monthly. There- fore, the paper presents suggestions for programs, for production runs, not for one-shot programs. Perhaps the cardinal principle of a bibliographic data processing system is that the machine must not be allowed to impose its characteristics on the data or the procedure. In the case of library procedures, long experience has accrued; indeed, libraries are thousands of years old, while books have been printed for hundreds of years. Lessons learned empirically, decades and perhaps cen- turies ago, should not be discarded be- 3 Irwin H. Pizer, Donald R. Franz, EsteUe Brod.: man, "Mechanization of Library Procedures in the Medium-Sized Medical Library : I. The Serial Record," BuUetin of the M edical Library Association, LI (July 1963)' 331-38. I 95 96 I College & Research Libraries • March, 1966 cause of machine characteristics or be- cause of difficulties in program planning or coding. For instance, it is not neces- sary to sacrifice lower-case letters in printout from high-speed printers; fur- thermore, final printout can be done on typewriters controlled by computer-pro- duced punched cards or punched tapes, thereby achieving upper and lower case print. Perhaps the most important character- istic of the computer to be used is that it should have variable word length storage in its internal memory. In short, its core should be able to store variable length words characteristic of natural language. In the years ahead, language-data proc- essing will be being done on such machines, and a start made now should be in the right direction. The speed of the computer matters little, and indeed a slow computer may be better than a fast machine and most certainly will do better if high speed is acquired with the penalty of fixed word length. A programing language to be used for bibliographic data processing on a small computer must be machine oriented. In other words, the programing language of choice is a symbolic-assembly language as close to machine language as avail- able. Macros can be wasteful of space, and if used, should be only in the form of closed subroutines or routines used but once in the main program. The so- called high-level languages are certainly easier to learn but are inefficient and must be run on large machines. More- over, these languages restrict a computer in the operations it can perform so that they lend a provincial quality to their programs, whereas symbolic-assembly languages make possible the writing of sophisticated programs taking full ad- vantage of the computer's capabilities. Indeed, many workers employing large computers for language-data processing have increasingly used machine-oriented languages. Culbertson, in the paper referred to above, urged that librarians adopt COBOL for library computer programs. He based his suggestion on the attractive premise that use of COBOL would yield standardized programs that could be run on a large array of computer models pro- duced by various manufacturers. How- ever, he recognized that to achieve such standardization only one of the COBOL dialects could be used. To compile COBOL, a slightly larger computer con- figuration must be available since four tape drives are required with a 4 K core. Furthermore, COBOL is a problem- oriented language and such languages are not inherently suitable for symbol- manipulative programing. Still, Culbert- son was quite correct in pointing out that use of a symbolic-assembly language may mean that an extensive and ex- pensive reprograming task could accom- pany a change in equipment. The output of language-data process- ing is often characterized by ·having one section which is fixed in its two dimen- sional design and other sections which may be varied two dimensionally. For instance, printed text on a page may vary in number of lines and in line length from book to book, but it is essentially a block of words. On the other hand, the running head and page number vary ex- tensively in position. An analogous fixed and varied format is the card found in the conventional library card catalog. Here, bibliographic data describing books occur in the same relative posi- tions on cards in most libraries. However, call numbers for books and headings for subjects, titles, editors, etc., are placed in varying positions according to in- dividual library practices. Programs can be written for formats having varied and fixed characteristics by assigning the processing of the fixed characteristics to the main program. In addition, a genera- tor program can be written which will operate on data from a control card whereby the generator will write brief programs to determine position of that Programing for Bibliographic Data Processing I 91 part of the format which may vary from one product to another. Siich generator programs give flexibility to bibliographic data processing. Since language-data processing requires relatively large work areas in the computer's internal memory, the generator should be written in the work area and erased after it has written its programs to be used by the main program. Program tables can also be de- vised to attain similar flexibility. It is not possible to employ a monitor program in the conventional sense of that phrase in a small computer, but a somewhat similar effect can be attained by employing a systems tape, if a pro- gram is too large to fit into core. In such a circumstance, the program should be divided into three or more sections, one section being common to the others. In assembling the program, the common section can be assembled either first or last. Should it be first, the others should be written on tape with one of the tape sections in core at the start of processing. In the case of three sections, the third section may be left in core and written on tape during processing and the sec- ond section called in to replace it. This procedure may, of course, be reversed, for when the third section is needed, the common section containing calling in- structions can bring it back into core. Also, the systems tape may be used to store data for recall in subsequent proc- essing. If a large amount of processing is necessary, several programs may be needed. In such a circumstance, the pro- grams can be planned for a sequential system wherein they are linked together. The program system that yielded most of the experience on which this paper draws consists of four major programs, two of which have generators, and two that employ a systems tape. The four major programs have been linked to- gether so that when the first program has completed its processing of the data, it loads the second program automatically. The third and fourth are similarly loaded. Such a program system greatly simplifies operation, for the operator need only place the data cards after the first program and depress the load but- ton. Symbol-manipulative programs are characterized by their high percentage of logic instructions. However, symbol- manipulative programs for language- data processing have an additional typi- cal feature in that the core position of data is locked into the logic. It is this characteristic of logic interlocked with position that places exceptionally heavy requirements on indexing features of a computer and on the programer to con- trive efficient, qlosed subroutines to maintain index registers at correct values. Another characteristic of such symbol- manipulative programs is a high density of labels. If the assembly program has a limitation of the number of labels it will process in one iteration, care should be taken to keep the number of labels below that limit by use of actual address or by other means. Reiterative assembly of one large program for processing bib- liographic information on a 1401 requires 25 minutes and uses a box and a third of punched cards. Since de-bugging may sometimes constitute as much as 90 per cent of the total time of development of a computer program,4 it is important to minimize assembly time. If bibliographic data is to be sorted and arranged alphabetically, such sorting will constitute a major programing prob- lem. Moreover, to obtain differentiation in sorting, it will most likely be necessary to sort on more characters than origi- nally estimated. The technique for doing such sorting is, of course, to establish a separate sort control on which the com- puter actually operates. Initial articles should be removed from this control, and when various languages are in- volved, separate article tables for each ' Sherman, op. cit., p. 398. 98 I College & Research Libraries • March, 1966 language must be established in the pro- gram. Also, various spellings of "Mac" and many other letter, or letter and diacritical mark, combinations should or should not be brought together in the sort control, depending upon the filing rules adopted. Finally, there will be some elements by which filing is to take place such as numbers in a title. An ex- ample is 1066 and All That. If this title is to be sorted as though it were Ten Sixty- six and All That, there is no program which can instruct the machine to do it. Here it is necessary for the person proc- essing the data to write out a separate sort control in the form of TEN SIXTY- SIX AND ALL THAT, and the pro- gramer must provide for a signal which will enable the computer to recognize that it must use this human determined sort control. Another characteristic of language- data processing on a small computer is that the preparation of the data involves linking it with the program to be used. In general, it is necessary to identify each internal category of data with an exclusive code, and to assign such codes so that the program can recognize when processing moves from one category to another. Also, codes must be designed to identify sections of data within each category. These three simple coding re- quirements yield enormous flexibility in processing. Codes or flags should be positive and exclusive. Negative flags, such as blank spaces, should be avoided, for experience has shown that they can be troublesome. In establishing flags, one should use symbols that occupy but one column on a punched card. As a concomitant of this rule, upper-case characters should not be used as flags, for in many circum- stances it will necessitate the use of an- other flag to indicate upper-case. Hence, the flag would actually occupy two columns. Experience has shown that such two-column flags are difficult to program. Also, symbols should not be used which are in the regular printout of the sys- tem; it is preferable to use non-printing characters despite the unquestionable fact that they complicate the de-bugging process since they do not appear in a post listing of a program. A further caution about preparing data for processing is to avoid having fixed internal fields. In most, if not all cases, it will be necessary to have a fixed over- all record length, but it is unnecessarily restrictive of the data to have fixed fields within the record length, albeit that such fixed fields facilitate programing. Finally, if a major bibliographic data processing system is to be established, it is most desirable that there be one operational application made early in the development. Such an application will reveal various difficulties which it may be necessary to solve by altering the preparation of the data, by elaborating the program, or by using both these tech- niques. Should it be necessary to change the data preparation, much future grief will be avoided, and it is for this rea- son that an early application is advan- tageous. •• J