230 THE RECON PILOT PROJECT: A PROGRESS REPORT NOVEMBER 1969 -APRIL 1970 Henriette D. AVRAM, Kay D. GUILES, Lenore S. MARUYAMA: MARC Development Office, Library of Congress, Washington, D. C. A srtnthesis of the second progress report submitted by the Library of Congress to the Council on Library Resources under a grant for the RECON Pilot Project. An overview of the p1'0gress made from November 1969 to April 1970 in the following areas: p1'0duction, Official Catalog comparison, format mcognition, research titles, microfilming, investigation of inptlt devices. In addition, the status of the tasks assigned to the RECON Working Task Force are briefly described. INTRODUCTION An article was published in the June 1970 issue of the Journal of Library Automation ( 1) describing the scope of the RECON Pilot Project (hereafter referred to as RECON) and summarizing the first progress report submitted by the Library of Congress ( LC) to the Council on Library Resources (CLR). RECON is supported by the Council, the U.S. Office of Education, and the Library of Congress. In order that all aspects of the project might be brought together as a meaningful whole, the various segments, regardless of the source of support, were covered in the second progress report and have been included in this article. In some instances, it has been necessary to introduce a section by repeating some aspects already reported in the June 1970 article in order to add clarity to the content of that section. RECON Pilot Project/ AVRAM 231 PROGRESS-NOVEMBER 1969 TO APRIL 1970 RECON Production The production operations of the RECON Pilot Project are being handled by the RECON Production Unit in the MARC Editorial Office of the LC Processing Department. Printed cards with 1968, 1969, and 7-series card numbers have been provided from the Card Division stock for RECON input, and approximately 99,550 cards in the 1969 and 7-series have been received. Using prescribed selection criteria the RECON editors have sorted these cards and obtained approximately 27,150 eligible for RECON input. Approximately 150,000 cards in the 1968 series have also been received. The RECON editors have sorted 60,000 of these cards and obtained approximately 24,000 records eligible for RECON input. A large number of cards in these three series is already out of print, and replacement cards are being sent by the Card Division as soon as reprints are made. Each card eligible for RECON input from the above-mentioned selection process is also checked against a computer produced index of card numbers for records in machine readable form. Each number in the print index has a corresponding code to show on which machine readable data base the record resides. The source codes are as follows: M1-MARC I data base M2-MARC II, 1st practice tape M3-MARC II, 2nd practice tape M4-MARC II data base M5-MARC II residual data base (The two practice tapes contain records converted before the implementa- tion of the MARC Distribution Service to test the programs and input techniques.) The print index used for the final selection of the 1969 and 7-series card numbers contained only the records from M2-M5 (the MARC I data base consists of the records converted during the MARC Pilot Project which ended in June 1968). For the selection of the 1968 records, another print index had been produced which contains numbers for records on all five data bases. If the RECON editors find a match on the print index, the appropriate source code is added to the printed card; these printed cards are then maintained in a separate file. (Later in the project, the records in the data bases identified as M1 to M3 will be updated to conform with the current MARC II format and added to the RECON data base.) The remaining cards for RECON are reproduced on input worksheets and edited. To date, approximately 9,750 records in the 1969 and 7-series have been edited for RECON. RECON records in the 1969 and 7-series are being input by a service bureau. The contractor uses IBM Selectric typewriters equipped with an OCR typing mechanism, and the hard-copy sheets are run through an 232 journal of Library Automation Vol. 3/3 September, 1970 optical scanner. The output from the scatmer is a magnetic tape which is processed by the contractor's programs to produce a tape in the MARC Pre-Edit format. This tape is then sent to LC and processed by the MARC System programs to produce a full MARC record. Since the input for the retrospective conversion effort will be printed cards (or copies of printed cards from the Card Division record set), it will be necessary to compare these with their counterparts in the LC Official Catalog. The printed card for each main entry in the Official Catalog will show if any changes have been made which did not warrant reprinting these cards to incorporate these changes. Items on a printed card that could be noted in this fashion include changed subject headings, added entries, and call numbers. Since these will be important access points in a machine readable catalog record, it was felt that such revisions should be reflected in the RECON records. The RECON Report ( 2) contains a lengthy discussion of the various factors involved in the catalog comparison process, such as the percentage of change in relation to the age of the record, the difficulty in ascertaining any changes because of language, interpretation of cataloging rules, etc. To determine the most efficient and least costly method of catalog compari- son, two RECON editors were assigned to conduct an experiment to test eight different methods as follows: 1) Print-out checked in alphabetic order-single group of 200 records. 2) Proofsheets (already proofed) checked in worksheet (card number) order-group of 200 records in batches of 20. 3) Proofsheets (not proofed) checked in worksheet (card number) order -group of 200 records in batches of 20. 4) Proofsheets (already proofed) checked by mental alphabetization- group of 200 records in batches of 20. 5) Proofsheets (not proofed) checked by mental alphabetization-group of 200 records in batches of 20. 6) Worksheets before editing (not input) checked by mental alpha- betization-group of 200 records in batches of 20. 7) Worksheets before editing (not input) checked in alphabetical order -group of 200 records in batches of 20. 8) Worksheet before editing (not input) checked in worksheet (card number) order-group of 200 records in batches of 20. Mental alphabetization means the searching of all the entries in a batch beginning with "A," then all the entries beginning with "B," etc., even though the batch is not in alphabetical order. Each editor used 200 records for each method, made the necessary corrections, and recorded the time required as well as the number of corrections made. . Figure 1 shows the average number of records checked in an hour using the eight different methods of catalog comparison. Tables 1 and 2 give the estimated cost per record for each of the methods. In determining Met.hod One : PRINffi-0W Checked in ALPlt.AaET'LCAL 0,rA.e¢ Metn0d. Twa : PR0€iF~li~lt!S (Already ProEPfed) Cheeked.in W:ORK&HEET Qrder Method ·Th~~e: PitooF:S:tiEETS--(No;t Proo.£,ed) Cheek:ed i .n WORKSRBET Orde,r Merll:od Fou·r Method Five Method Six P'ROtlF'SHEETS (Already Proofed) Che.c:ked bv .tmiJ!f,J:M.. ALL'~ETUA.',l'l~ON liRoOFS·H:EETS (Not Pt;';oe:i ed) Checked bv MENTAL ALPHABETIZATION Method Seven: WORKSHtBftTS Before Editing (No•t tnntht) - ~Jh\~E,':f!,Q-!\1. QJ;!>- to ..... 0 ~ = Table 4. Input Devices >:l - -- ----------- - 0 - Manufacturer I Mn};~ne Ke yboard Reco rd Price t'-1 Model Configu- Display Length i•~ Mont1:f! Remarks .... Purchase ~ ration Characters Rent a Cybercom I KIC MARK! KP None 80 $7970 $145 Con.verter-$1801month ~ Data Action KI C 150 KP Projec- 720 $5900 $155 Converter-$5751month > tion .: IBM I KI C 50 KP Back- 7W $9605 $175 Converter-$340/ month -0 I KI C light In£nite Converter-$3401 month ~ IBM MTSTV T Printed $100 -IV T Printed In£nite $277 ... _ 0 Sycor I KIC 301 T CRT 216 $7000 $150 Converter-$1301 month ~ Tycore KI C 8500 KP Light- 240 $6000 $120 Converter-$220/month < Emitting 0 Diodes - Viatron KI C 21 TI KP CRT Infinite $1920 $39 Many options affecting price "' Burroughs KI M N-7000 KP Projec- 160 $8400to $165 to ...___ "' tion $12,200 $277 Honeywell KI M Keytape TI KP Back- 80-400 $7500 to $148 to Pooler for 2 stations- Cl) light $33,000 $735 $2001 month exh-a (I) ~ Keymatic KI M 1091 T Back- In£nite $8750 $166 Price is for basic 88 keys. 256 .... (I) light unique keys available as well s as optional printer. 0" 100 or 200 (I) MAl I KI M 100-92 KP Projec- $6400 $160 Pooler for up to 8 stations- v~ tion $401month extra ...... Mohawk I KI M 6400 KP Back- 80 $8000 $145 Pooler for 3 stations- co light $1751month extra --l c Motorola I K/ M KB800 KP None wo $8500 None Pooler for 7 stations- Potter I KI M KDR KP BCD 160 $8100 $165 purchase price $9700 Pooler for 3 stations- (Bit) $451month Sangamo KIM DS9100 KP Back- 120 $8200 $177 Pooler for 10 stations- Vanguard KI M Data- KP scribe light None 200 $247 / month extra $8500 $175 ComoutP-r T(j'T Tnfo. .,.. C'R'T' QM ~ 1 0. Clf'V\ .&. - Ont'!n ~ '"'""" - . Coo soles System Computer KIT 6000 KP Entry System Mohawk KIT 9000 KP Computer Machinery KI T Key Process- KP General KI T ing 2100 T Computer Systems Inforex KIT Key Entry KP Penta KIT Key KP Associates Logic Systems Eng. K/ T Keytran KP Logic Corp. KI D LC-720 KP Legend: KI T = Key to magnetic tape system KID = Key to disk system KI C = Key to cassette None 496 Back- 80 light Back- 250 light Printed 200 CRT l28 Back- 200 light None 300 CRT 350 KI M = Key to computer compatible magnetic tape KP = Key punch T IKP = Typewriter or key punch $78,000 $ 200 $16,200 to $360 to Two to 6 stations $42,000 $925 $53,000 to $1040 to Four to 16 stations $145,000 $2840 $92,500 to $2055to Eight to 32 stations $168,100 $4095 $81,240 to $2350 to Seven to 39 stations $273,120 $7885 $30,300 to $760 to Four to 8 stations $35,100 $960 $110,000 $3000 to Eight to 64 stations to $8600 $345,200 $100,000 $2875 to Nine to 48 stations to $6350 $220,000 $148,000 $2450to Four to 16 stations to $5800 $300,000 T = Typewriter Backlight= a matrix consisting of all individual characters that can be keyed. Each character, as keyed, is displayed one at a time in its particular position in the matrix. Projection and Light-emitting diodes = A one-character position dot matrix. Each character, as keyed, is displayed one at a time in the same position. BCD (Bit) = Lights displaying the bit position (on, off ) of individual characters. Each character, as keyed, is displayed one at a time. (The prices quoted and the characteristics given of each device reflect the best information that could be obtained by the RECON staff.) :::tl ~ ("') 0 ~ "';j [ ~ ~- ~ --.. ~ ~ ~ ~ 244 Journal of Library Automation Vol. 3/3 September, 1970 could be assigned to single keys and translated to their proper value by software, thus reducing the amount of keystroking required. The Keymatic appears worth further investigation; therefore, the Library may rent a device for several months for testing and evaluation. A typist will be trained in current MARC/RECON procedures and assigned to the Keymatic as soon as her training period has been completed. The first month will be spent training on the Keymatic prior to the actual input of RECON records to obtain production and error rates and cost evaluation for comparison purposes. Serious consideration was also given in the RECON Report to direct-read OCR equipment; however, at that time no equipment existed that offered the technical capability to perform the conversion of the LC record set. Since then, preliminary investigation of the Model370 CompuScan Univer- sal Optical Character Reader proved interesting enough to continue further exploration of the device. The Model 370 CompuScan is a computer directed flying-spot scanner which matches the scanned portion of a character with a character described in the core memory of the computer. The manufacturer has examjned a sample of LC printed cards selected at random over a period of twenty years and has concluded that although the hardware is sufficient to read the record set optically, significant soft- ware effort would be required. The results of the sampling indicated that the record set is not constituted entirely of "mint" cards, i.e., cards printed from the metal of the original Linotype composition, but is composed of originals and reprints of the original. When the stock of the original printing is close to depletion, the card is reprinted by photographing the card, and duplicates are made by a photo-offset process. As this cycle is repeated, the card for any one title could be several generations removed from the original. In some instances, a microscopic examination of the cards seems to indicate that the matrices used in the Linotype composition were worn. Because of these factors, what might appear as the same character to the naked eye would represent different pattern configurations to the scanner's core memory. · The coarseness of the card surface may also cause variations in the same characters. LC cards have a high rag content in order to meet the archival standards required by libraries. The roughness of the surface does not affect the readability for the human but may cause variations in a given character when read by an optical scanner. Another significant problem with LC cards concerns characters which touch, i.e., connections between what are intended to be distinct characters but are read by the scanner as one. For example, if a lower case "n" were next to a lower case «t" and the cross bar on the "t" touched the "n," the scanner would consider the combination of the "n" and the "t" as one character. Software must be written to handle the variant character and the touching RECON Pilot Project/ AVRAM 245 character problems. In the case of the touching characters, the machine must recognize some allowable limit of reading a single character, and when this limit is exceeded, the pattern read rnust be divided and matched against single-character patterns held in core. Programs can be written so that if either of the above conditions occurs, the output on magnetic tape will be flagged for later spot checking, permitting the scanner to continue to operate at throughput speeds without human intervention. The resultant magnetic tape would serve as input to the Library's format recognition programs to reformat the scanner's output into the MARC II format. It has been estimated that the throughput speed of CompuScan would be in the vicinity of 1800 cards per hour. The LC record set will be microfilmed according to the specifications required by the scanner. Since the scanner operates with negative film, a very dark background with a very clear, white image is necessary. A tentative cost estimate of the microfilming and reading has been computed at approximately fifty cents per 1000 characters output on magnetic tape - (approximately three LC cards). This price does not include the cost of the software. Original printed "mint" cards will be used to test the device without implementing the required software, and depending on the results, investi- gation may be continued. The keying of the 1969 RECON records has been performed by a contractor using an IBM Selectric typewriter with the resulting hard copy fed through a Farrington optical character reader. As part of the con- tractor's services to the Library, production rates were monitored and reported. This gave LC the basis to compare two devices, the key-to- cassette used at the Library of Congress for the MARC Distribution Serv.ice and the equipment used by the contractor for RECON records. To make the comparison in Table 5, it was necessary to determine the costs for each method using the techniques developed in the RECON report (9). Some modifications of cost were made to the original RECON estimates because actual figures are now available. MARC costs were obtained by dividing the costs of the manhours for typing and proofing in a given period by the number of records added to the MARC master file in the same period. The equipment cost per record was also based on the number of records added to the master file. Production rates associated with particular tasks were not used. The manpower figures supplied by the contractor were limited to hourly production rates; therefore, to obtain the cost per record for OCR typing it was necessary to project the hourly rate to cover a manyear. The es- timated annual production of a typist was then divided into the annual salary of a GS-4 (step 1) typist incremented by 8.5% for fringe benefits. The OCR equipment costs were computed on the basis of figures supplied by the contractor, assuming ownership of the OCR-font typewriter and service bureau rental of the scanner. 246 Journal of Library Automation Vol. 3/3 September, 1970 Table 5. Input Costs per Record 1. Manpower Key to Cassette Method Typing $ .45 Proofing .70 Total $1.15 OCR Method Typing rate of contractor 1,000 records in 104 hours or 9.6 records per hour Typing cost at LC $5,522 + 8.5% ( $5,522) 9.6 X 1,338 $ .466 Proofing rate of RECON editors at LC: 1,534 records proofed in 173 hours or 8.9 records per hour- 20% = 7.1 records per hour Proofing cost at LC $6,882 + 8.5% ( $6,882) $ .786 7.1 X 1,338 Typing $ .466 Proofing .786 Total $1.25 2. Equipment (costs do not include maintenance where applicable ) Key to Cassette Key to Cassette Monthly rental $100.00 Converter-Monthly rental prorated over 10 Key to Cassettes 26.00 Total $126.00 Hourly cost (assumes 132 hours a month) $ .955 Effective production rate of Key to Cassette Average weekly MARC output 1,005 4 K t C tt 't 120 = 8.4 records/hour ey o asse e um s Record cost of Key to Cassette and converter $.955 8.4 = $ .114 RECON Pilot Project/ AVRAM 247 OCR Method OCR-font typewriter Purchase price 40-month amortization Hourly cost (assumes 132 hours use) Effective production rate of OCR typewriter $500.00 12.50/month .095 9.6 records/hour X 1,338 homs d /l 132 hours X 12 months 8·1 recor s 10ur Record cost of OCR typewriter $.095 sr=$ .o12 OCR scanner-service bureau hourly rental 10,000 lines/hour each record- IS lines 555 records/hour Record cost of OCR scmmer Total record cost for equipment $.012 + $ .09 = $ 50.00 $ .09 $ .102 The cost of proofing in the OCR method was based on the RECON experience at LC modified by contractor experience. In actual practice, OCR records are proofed and corrected by the contractor before they are proofed by RECON editors. It was assumed that double proofing is unnecessary but that allowance should be made for the added difficulty of reading copy with a higher proportion of errors. (A preliminary study of errors on RECON proofsheets has shown that there are fewer typographi- cal errors on RECON proofsheets than on current MARC proofsheets.) For this reason, the number of RECON records proofed in an hour has been decreased by 20% in the calculations. On the basis of the calculations in Table 5, the comparative input costs are summarized as follows: Table 6. Estimated Input Cost per Record Key-to-Cassette OCR Manpower: Typing $.45 $.47 Proofing .80 .78 Equipment .11 .10 Totals $1.26 $1.35 The final figures indicate that the two methods are very close in cost. As presently calculated, the key-to-cassette method is less expensive than the OCR method. It is easy to see that a slight change in any cost or production rate could make the OCR method less expensive. If the proofing 248 Journal of Library Automation Vol. 3/3 September, 1970 rate of 8.9 records per hour were maintained instead of decreasing to 7.1 per hour, the OCR proofing cost would drop to $.63, and the total price for this proposed method would be $1.20. One way to test the assumption of the added difficulty of a single proofing would be to obtain uncorrected records from the contractor as a means of determining the actual proofing rate under that condition. RECON Tasks The four tasks that have been identified for study by the Working Task Force are: 1) levels of completeness of MARC records; 2) implications of a national union catalog in machine readable form; 3) conversion of existing data bases in machine readable form for use in a national biblio- graphic service; and 4) study of problems involved in any future distribu- tion of name and subject cross reference control files. Progress to date on the first three tasks is described in the following paragraphs. Task 1 has been completed, and an article summarizing the results of a report submitted to CLR has been published in the Journal of Library Automation, June 1970 ( 10). The following conclusions reached by this study are quoted from the article: 1) The level of a record must be adequate for the purposes it will serve. 2) In terms of national use, a machine readable record may function as a means of distributing cataloging infor- mation and as a means of reporting holdings to a national union catalog. 3) To satisfy the needs of diverse installations and applications, records for general distribution should be in the full MARC II format. 4) Records that satisfy the NUC function are not necessarily identical with those that satisfy the distribution function. 5) It is feasible to define the characteristics of a machine readable NUC report at a lower level than the full MARC II format. Task 2 consists of an investigation of the implications of a national union catalog in machine readable form. A design of such a system is needed, and although the implementation of such a project is beyond the purview of the Working Task Force, some of the technical and cost factors should be examined and defined for possible future research. As a framework for discussion purposes, a future reporting system for the National Union Catalog was postulated based on the present reporting system as follows: Contributors LC Outside libraries Present Report Form Printed cards Locally produced cards and LC cards Future Report Form LC - MARC data (for all records) MARC data (for all records) or records submitted to NUC to be keyed as machine read- able records RECON Pilot Project/ AVRAM 249 The problems of the control number and library location symbols were considered, but a tentative decision was made that recommendations should be forthcoming when the American National Standards Institute Sectional Committee Z39 has completed its work on library identification codes. The indicators and subfield codes to be included in the machine readable NUC records would depend on the optimum file arrangement of the suggested bibliographic listings. The Library of Congress is presently engaged in a filing rules study which should influence the inclusion or exclusion of particular content designators. Task 2 is still in progress. Task 3 is the investigation of the possible utilization of other machine readable data bases for use in a national bibliographic store. The task was divided into several subtasks as follows: 1) identification of useful data bases for the purposes described (content and bibliographic completeness); 2) cost of the conversion from a local format to a MARC II record; 3 ) cost of updating records not already in the LC data base for consistency and missing data by comparing the records with the Library of Congress Official Catalog; 4) cost of comparing the record for the existing LC machine readable records to eliminate duplicate records. To satisfy the first subtask, a questionnaire was sent to 42 organizations. The information requested included: 1) Availability of data bases-maintained by library or service bureau, and permission to copy data base. 2) Use of the data base-for acquisitions, production of book catalog, circulation system, etc. 3) Composition of data base-monographs, serials, technical reports, etc. 4) Composition of data base-number of titles, imprint dates (pri- marily current, retrospective, etc.), language of records. 5) Source of catalog data-MARC Distribution Service, LC catalog card, local cataloging. 6) Data elements for monographs. 7) Format used in identifying data elements-MARC I format, MARC II format, etc. 8) Character set used. The results from this survey were analyzed, and a follow-up letter was sent to 22 of the organizations, requesting further information as follows: 1) An estimate of the number of monographs added to the data base each year. 2) Representative group of twenty-five entries for monographs includ- ing both fiction and non-fiction. 3) Details on the character set used in the machine readable data base.· 4) Detailed specifications of monographic record format. Responses from this last letter have been received and analyzed. This analysis should identify a limited number of machine readable data bases that will be subjected to further content and cost analysis. 250 journal of Library Automation Vol. 3/3 September, 1970 OUTLOOK The RECON Project continues to be on schedule. The Working Task Force has met several times for deliberations on the assigned tasks; in addition, members have been briefed on the progress of the pilot project and their advice has been sought. Thus, individuals interested in the problems of bibliographic conversion guide the project throughout its development. The Library of Congress RECON staff continues to maintain liaison with individuals and organizations working in any facet of the project's scope, hoping to bring all expertise possible to bear on the problems involved. It is significant, although not fully recognized at the onset of the RECON Project, that the solution to many of the problems under exploration will have impact on current conversion as well as retrospective conversion. This is evident at the Library of Congress where MARC and RECON, although staffed separately in the production area, share staff in the Information Systems Office, and the project is known as MARC/RECON. Coordination continues between the RECON Project and the Card Division Mechanization Project. The RECON Project Director is the technical adviser for the Card Division Project, and under her general direction, a computer analyst in the Information Systems Office has been assigned full time to the project. The analyst has been given a detailed orientation to the procedures and computer programs for MARC/RECON and the specifications for the Card Division Project. This exposure is necessary to guarantee that there is no duplication of effort between the two projects and that the design work for the Card Division Project includes the possibility of a future national service for machine readable cataloging, both current and retrospective. (The MARC Distribution Ser- vice is such a national service for English language monograph cataloging data, but what is assumed here is a service of a much broader scope.) Although progress has been made in many of the tasks included in RECON, several methods of input described in the RECON Report can only be fully evaluated when the format recognition programs are imple- mented. According to present estimates, this should take place toward the end of 1970. Much remains to be accomplished. The Library of Congress will continue to make its progress known as rapidly as possible, because the results of the pilot project will have great ramifications for the entire library community. ACKNOWLEDGMENTS The authors wish to thank the staff members associated with the RECON Pilot Project in the Technical Processes Research Office and the MARC Editorial Office in the Library of Congress Processing Department, and RECON Pilot Profectj AVRAM 251 those in the Information Systems Office, for their respective reports, which were incorporated into the progress report submitted to the Council on Library Resources and which provided significant contributions to this paper. REFERENCES l. Avram, Henriette D.: "The RECON Pilot Project: A Progress Report," Journal of Library Automation, 3 (June 1970). 2. RECON Working Task Force: Conversion of Retrospective Records to Machine-Readable Form (Washington, D. C.: Library of Congress, 1969), pp. 32-33. 3. Avram, Henriette D., et al.: "MARC Program Research and Develop- ment: A Progress Report," Journal of Library Automation, 2 (December 1969)' 250-253. 4. RECON Working Task Force: Op. cit., p. 31. 5. National Microfilm Association: Glossary of Terms for Microphotogra- phy and Reproductions made from Micro-Images. 4th rev. ed. (An- napolis, Md.: National Microfilm Association, 1966), p. 8. 6. Ibid. 7. Ibid., p. 52 8. Hawken, William R.: Copying Methods Manual (Chicago: Library Technology Program, American Library Association, 1966), p. 243. 9. RECON Working Task Force: Op. cit., pp. 58-59, 86, 93. 10. RECON Working Task Force: "Levels of Machine-Readable Records," Journal of Library Automation, 3 (June 1970). /