105069 287..301 Duplicate detection algorithms of bibliographic descriptions Anestis Sitas School of Philosophy, Aristotle University of Thessaloniki, and School of Library Science, Technological Institute of Thessaloniki, Thessaloniki, Greece, and Sarantos Kapidakis Archive and Library Sciences Department, Ionian University, Paleo Anaktoro, Greece Abstract Purpose – The purpose of this paper is to focus on duplicate record detection algorithms used for detection in bibliographic databases. Design/methodology/approach – Individual algorithms, their application process for duplicate detection and their results are described based on available literature (published articles), information found at various library web sites and follow-up e-mail communications. Findings – Algorithms are categorized according to their application as a process of a single step or two consecutive steps. The results of deletion, merging, and temporary and virtual consolidation of duplicate records are studied. Originality/value – The paper presents an overview of the duplication detection algorithms and an up-to-date state of their application in different library systems. Keywords Cataloguing, Algorithms, Bibliographic systems, Records management Paper type Research paper Introduction The ideal setup for a library catalogue would be to register a unique bibliographic record for each bibliographic entity. However, bibliographic databases include several types of duplicate records. Even if the search cues are clearly specified, locating the correct entry is still an issue that requires further investigation as new materials are added in a variety of media. Duplicate records slow down the indexing process and significantly increase the cost for saving and managing data not to mention that their retrieval is delayed. As a result, duplicate records constitute a system deficiency and compromise quality control for all parties involved, namely users, catalogers, and technical staff. Shared cataloging further aggravates the problem as, through the automated systems, each library-member of one system can access the other members’ records. Administrators have to have to improve the bibliographic database quality and keep the database functional and “clean”. Duplicate records In the environment of bibliographic databases, a duplicate record could be defined as two or more records which stand for or describe the same document (defined as any information resource). Duplicate records can cause problems to the following areas: The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm Duplicate detection algorithms 287 Received 11 October 2007 Revised 22 October 2007 Accepted 27 January 2008 Library Hi Tech Vol. 26 No. 2, 2008 pp. 287-301 q Emerald Group Publishing Limited 0737-8831 DOI 10.1108/07378830810880379 . User information overload. Because of the recall of a larger number of documents the user is presented with more information than he or she can actually handle. . Reduced system efficiency. The actual number of records in the database is increased and therefore complicating the efficiency of indexing. This also hinders searching, cataloging decision-making and affects end-user satisfaction. . Low cataloging productivity. Identifying duplicate records and cleaning the database requires valuable time by catalogers, which could be spent on other essential tasks. . Increased cost for database maintenance. More time spent on database maintenance results to an increased cost. Possible reasons for the existence of duplicate records include novice searchers, the inability for successful searches, and the wish for a “perfect” record to be entered (Wanninger, 1982). Additional factors for record duplication include: . local practices and policies of cataloging; . cataloging inconsistencies; . careless record entering; and . errors in the syntax of MARC format. Record matching algorithms The existence of duplicate records constitutes a problem which is becoming increasingly alarming in networked environments, as the size of individual databases increases and new cooperative networks or consortia are created. In order to reduce the existence of duplicate records, new software is developed using special detection algorithms. Record matching algorithms are programs used to maintain the integrity of bibliographic databases. It would be quite easy to create a process that will match two identical bibliographic descriptions but it is not as easy to match similar records (Hunstad, 1988). Developing a detection and deduplication process Designing the process of detection and deduplication of records within a bibliographic database should take the following into consideration: . Design goal. Specifying which types of documents will be represented in the records to be processed (articles, journals, etc.). . Specification of duplicate records. Detailed definition of the term “duplicate record” based on the needs of the particular database. . Application of the process. Specifying whether the process will be applied automatically, semi-automatically, or manually. Creating a record-matching algorithm In order to develop an effective algorithm, it is essential to define the application steps, the MARC fields to be used as matching keys, and the criteria for identifying and assessing record similarity/supplication. LHT 26,2 288 Application steps The algorithm can be applied as a one- or two-step comparison. A final step follows which deals with the management of the duplicate records. The single step application of the algorithm is, in most cases, a compromise in order to achieve a fast and inexpensive deduplication. In general, these algorithms are more general and with loosely defined criteria resulting in a large number of duplicate records in need of further control. During the initial step of a two-step algorithm, a file of duplicate records is created based on a limited comparison of fields. Its principal aim is to minimize the number of comparisons during the second step and reduce mismatches that could lead to the deletion of unique records. The second step verifies matches from the first step and then applies a detailed and accurate comparison to determine actual duplicates. Selection of fields In order for such an algorithm to be created, it is important to select fields which exhibit significant stability regardless of who created the record (specific cataloger or bibliographic agency). The fields with less stable data offer low probability for record matching (Meir and Lazinger, 1998). Although deduplication based on a control number (ISBN, etc.), is the best method of detection, it does not always ensure full detection. Other data serving as sources for detection include author, title, publisher, pagination, place and the year of publication (Coyle, 1992). Matching keys The algorithms for detection of duplicate records use matching keys, which are strings constructed from a pre-selected field or combination of fields. A field can be used as a key in part (e.g. ISBN), or whole (e.g. title proper). Moreover, a combination of fields or a combination of field parts can also be used. Before these keys are created, the data are processed for normalization of spacing, punctuation, special fonts or characters, and capitalization.. In addition, a variety of techniques are used to accommodate for field content differences such as spelling errors, missing data, and small variations of words. These techniques include truncation, keywording, Harrison Keys, Hamming distance, USBC, and others (Toney, 1992). Matching evaluation Two methods are used to evaluate the matching of duplicate records: (1) Field comparison. This is based on binary comparisons of selected fields, that is, if fields appear to be the same or not. The software uses YES/NO indications. When the entire field is used, the comparison is safer but the process is time-consuming. This method is very strict and complicates the detection of records that have variations in cataloging or data entry errors (O’Neill and Oskins, 1990). (2) Weight assigning. This method concerns the matching of strings that estimate the similarity assigning weights/values which do not reflect bibliographic significance of the data, but their usage in the recognition of similar records (Coyle, 1992). The matching algorithm allows the merging or deletion of entries only if the assigned weight reaches a pre-determined value, a threshold. This method is open to the existence of minor differences in field content, spelling Duplicate detection algorithms 289 errors, completeness or missing data, and variations in cataloging practice (Coyle and Gallaher-Brown, 1985). Duplicate records handling Another element in the design of the duplication detection algorithm is the decision of how to handle duplicate records once they are detected. Toney (1992) presented three main practices: (1) one record is selected as the master record and all others are deleted; (2) one record is selected as the master record and all non-matching fields from the other records are added to the master (merging); and (3) all records are kept but clustered around a master record. Several variations can be added to the above practices. These include: to retain and maintain the record that was entered first in the database and delete the most recent ones; to retain and maintain the most recent record and delete all previous ones; and to retain either the first or the most recent record and merge into it the unique information from all others. Finally, one may choose to merge duplicate records only during the process of searching or retrieval (on the fly). Merging can be made instantly and “virtually” just for the purpose of displaying a single record to the end user. Results of duplicate detection algorithms In every effort of duplicate record detection the matching process may bring about the following results: . Exact matches. Records which are absolutely identical. . Partial matches. Only some parts of the records are duplicated. . Mismatches, false matches. Although indicated as duplicates, the records do not represent the same document. . Missed/undetected matches. Existing duplicate records that are not detected by the algorithm. Mismatches are considered a more important problem than the missed matches, since when deleted there is a permanent loss of information. To avoid this problem, the algorithm should use a loose method so that it gathers records with a degree of variations but avoids possible deletion of bibliographic information. On the other hand, it should be a tight method so that it restricts the accumulation of a large number of possible duplicate records and at the same time it does not allow the loss of genuine duplicate records (Meir and Lazinger, 1998). Algorithm categorization Types of material and status This paper describes ten algorithms. Table I presents these algorithms based on the type of records they are designed to detect. In other words, it specifies whether they refer to the detection of duplicate bibliographic records of monographs, serials journals, journal articles, or other types of material. In addition, the current status of each algorithm is noted. Their status may be defined as: . Prototype systems. Applied in a lab environment. LHT 26,2 290 . Inactive. While they were once applied in a real environment, their application is now abandoned. . Active. Algorithms that still applied. The algorithms that will be presented further on concern bibliographic records of monographs, except than the one by Oak Ridge National Laboratory which addressed journal articles. Algorithms for ALEPH-ULM, MDBUPD and IUCS are also applied to other types of documents (microforms, maps, etc.) while the one for MELVYL handles journal articles apart from monographs and journals. Finally, the Union Catalog of Greek Academic Libraries algorithm manages all sorts of materials except journals. As far as the state of their use is concerned, four out of ten (40 percent) are of the research type (Oak Ridge National Laboratory, MDBUPD, IUCS and Hickey and Rypka). Half of them (50 percent), including ILCSO, DDR, OPAC, MELVYL and Union Catalog of Greek Academic Libraries, continue to be in use even today. One algorithm was applied to the ALEPHs ULM catalog but its application ended in 1998. Processes of application and evaluation Apart from the type of materials they are applied to, these algorithms can also be distinguished according to the following characteristics: . Application. This refers to the number of stages of applications as either one- or two – step processing. Three algorithms (30 percent) are applied in one step (ALEPH-ULM, ILCSO, and Union Catalog of Greek Academic Libraries). The remaining seven algorithms (70 percent) follow the practice of two-step process. . Evaluation. This refers to the methods of comparison used to assess whether two or more bibliographic records are identical. These methods include either a comparison between fields or the assignment of weights. Of the algorithms presented in this paper, 40 percent use the method of the field comparison (ALEPH-ULM, MDBUPD, IUCS, and Union Catalog of Greek Academic Libraries). The remaining 60 percent, assign points/values for weights. Document type Status Monographs Journals Other Prototype Inactive Active ALEPH-ULM U U U U ILCSO U U U Greek Union Catalog U U U OAK Articles U MDBUPD U U U IUCS U U U OCLC (Hickey and Rypka) U U DDR U U U COPAC U U U MELVYL U U Articles U Note: U ¼ Yes Table I. Document type and status Duplicate detection algorithms 291 Table II presents each algorithm and their respective application method, whether the application is done during the process of searching or retrieval (on the fly), and their evaluation method. Final handling and algorithm running Furthermore, we can distinguish algorithms according to the final handling of the detected duplicate records (deletion or merging), as well as whether this process is done online or offline. Final handling information for each algorithm is presented in Table III. Final handling This refers to the final stage of the process of detecting duplicate records. Three programs (ILCSO, MDBUPD, IUCS), 30 percent, delete duplicate records. The MDBUPD and IUCS algorithms end up deleting the spare ones and retaining just one record, while ILCSO selects and retains the most suitable one. In total, five of them, 50 percent, including ALEPH-ULM, Union Catalog of Greek Academic Libraries, DDR, COPAC, and MELVYL, merge duplicate records in one integral record. COPAC merges Application Evaluation Steps On the fly Field comparison Weights ALEPH-ULM 1 U ILCSO 1 U Greek Union Catalog 1 U OAK 2 U MDBUPD 2 U IUCS 2 U OCLC (Hickey and Rypka) 2 U DDR 2 U COPAC 2 U U MELVYL 2 U U Note: U ¼ Yes Table II. Algorithm application and evaluation methods Final handling Algorithm running Deletion Merging Offline Online ALEPH-ULM U U ILCSO U U Greek Union Catalog U U OAK * * U MDBUPD U U IUCS U U OCLC (Hickey and Rypka) * * U U DDR U U U COPAC U U U MELVYL U U Notes: U ¼ Yes; * ¼ Not available Table III. Final handling and time of algorithm running LHT 26,2 292 the records in two of its three segments (the first segment includes only the British Library records and each one of the other two segments include approximately the 50 percent of the other catalog records). Among the three segments, however, there is no physical merging but it makes possible to present merged records to users in real time during the search. MELVYL’s practice does not lead to the physical merging of duplicate records, but to online presentation of merged records during the recall phase. For two out of ten algorithms (Oak Ridge National Laboratory, Hickey and Rypka), 20 percent, there is no information available. Application time This refers either to the offline or the online process. All algorithms “run” offline. Only three of them (30 percent) have the ability to apply online procedures as well. The Hickey and Rypka algorithm was designed to run both ways, DDR was designed to be applied both ways as well but the offline procedure is preferred. Finally, in COPAC part of the procedure is applied offline and part of it is applied online. The term “online” is used to refer to the real time running. Fields used for the creation of keys (monographs) Another significant characteristic of the algorithms are the MARC fields used for the creation of comparison keys. As we can see in Figure 1 the majority of algorithms (nine out of ten, 90 percent) use author, title and publication year for key creation. In addition, the algorithms also use the following fields in key creation: 70 percent of Figure 1. MARC field use for key creation (monographs) Duplicate detection algorithms 293 algorithms use pagination, 60 percent use ISBN, 50 percent use LCCN and/or publisher, 40 percent use edition statement, 30 percent use place of publication and/or series, 20 percent use fields like reproduction code, country of publication, government document number and ISSN, and finally 10 percent use fields such document type, language of the document, CODEN, control number, cataloging source, statement of responsibility, and volume/part and dimensions. Table IV presents detailed information on all fields that are used for duplicate bibliographic record detection. Algorithm efficiency Most organizations that apply duplicate detection procedures have not publicized their algorithm efficiency results. Even the data at hand are not absolutely comparable since each case is distinct and because the results of application depend on: . the type/types of documents; . the given definition of “duplicate record”; . the consistency of cataloging and data entry; and . the target set by each algorithm. From the data presented in Table V we draw the following: . efficiency among algorithm applications range between 44.95 percent and 99.62 percent out of the total identified duplicate records, real duplicate records were only the previously referred percentage; . mismatches range at a percentage below 1.5 percent; and . missed matches range somewhere around 4 percent with the exception of those presented in ALEPH, which range from 17.4 to 34 percent. Following is an analysis of each individual algorithm. One step algorithms ALEPH-ULM ALEPH is the network of the research libraries of Israel, which maintained the Union List of Monographs. The entries were loaded with the use of their detection and merging algorithm. It was based on the comparison of a stable number of not frequently met letters that came from four fields: author (five characters), title proper (seven characters), publication date and language (Lazinger, 1994). In a 1996 research study examined the efficiency of the algorithm when applied to monographs. It was reported that it yielded 0 percent mismatches for records describing Hebrew materials and 1.4 percent for English but it failed to detect existing duplicates in for 17.4 percent of English and 34 percent of Hebrew records (Meir and Lazinger, 1998). ULM, now named Union List of Israel, decided that their algorithm did not satisfy their demands and in 1998 stopped all deduplication efforts. Illinois Library Computer Systems Organization For duplicate record detection, the system uses indices of the following control numbers: OCLC, LCCN, ISBN, ISSN, and publisher number. When the data of these indices overlap, they are given specific values. Then, further actions, based on the sum LHT 26,2 294 F ie ld s A L E P H -U L M IL C S O G re ek U n io n O A K M D B U P D IU C S H ic k ey a n d R y p k a D D R C O P A C M E L V Y L D o cu m en t ty p e U R ep ro d u ct io n co d e U U C o u n tr y o f p u b li ca ti o n U U L a n g u a g e U L C C N U U U U U IS B N U U U U U U IS S N U U C O D E N U C o n tr o l n u m b er U C a ta lo g in g so u rc e U G o v er n m en t d o cu m en t U U A u th o r U U U U U U U U U T it le U U U U U U U U U S ta te m en t o f re sp o n si b il it y U V o lu m e/ p a rt U E d it io n U U U U P la ce o f p u b li ca ti o n U U U P u b li sh er U U U U U P u b li ca ti o n d a te U U U U U U U U U P a g in a ti o n U U U U U U U D im en si o n U S er ie s U U U N o te : U ¼ Y es Table IV. Fields used in the creation of keys (monographs) Duplicate detection algorithms 295 of the weights, are determined. This is an offline process. The following are the values recommended for the bulk import of records (ILCSO, 2004) (see Table VI). Once the comparison is done and the matching shows that two bibliographic records represent the same document, they are evaluated so that the most suitable is selected to remain in the database while the other will be deleted. For each field used for the matching process, there is a corresponding field weight to help decide which record will remain. The fields used for matching include: cataloging source, encoding level, agency that has modified the original record and bibliographic level of the record. In dubious cases the final decision is taken by comparing the records manually (ILCSO, 2004). Union Catalog of Greek Academic Libraries Use of this algorithm started in April of 2005. At the time of import, records are checked for duplicate detection and merging. Imported records are created in a variety of software and therefore records have differences in format, the number of letters, the holdings of existing records, etc. After loading these records are processed so there are no such variations. To accommodate this, the key is formed by taking data from the fields further down: Title, Author, Edition statement, Publication date and ISBN (Vougiouklis, 2007). Questionable duplicate records are kept in a work to be examined manually. Based on the algorithm evaluation, it was estimated that 44.95 percent of actual duplicate records were detected. Among the detected problems, 17.8 percent were mainly due to the applied key, while 12.47 percent were due to the policy issues, 7.05 percent represented cataloging problems, and 17.62 percent referred to other kinds of problems. Effectiveness Mismatches Missed matches % % % ALEPH-ULM * 0-1.5 17.4-34 Greek Union Catalog 44.95 * * IUCS 56.58-99.62 0.54 * OCLC (Hickey and Rypka 54-69 1.3 * Note: * Not available Table V. Algorithm efficiency Duplicate replace ¼ 100 Duplicate warn ¼ 30 Indexes and weights 035O ¼ 100 010A ¼ 20 020A ¼ 25 022A ¼ 15 028A ¼ 10 Table VI. Recommended values for bulk import of records LHT 26,2 296 Two step algorithms Oak Ridge National Laboratory In 1976, Oak Ridge National Laboratory created an algorithm aiming at detecting duplicate records of cited journals articles. It was used offline and it produced fixed length keys (Hickey and Rypka, 1979). Publication date, initial page number, journal CODEN, volume number, and samplings from the author, journal title, and article title elements were used for record matching. For duplicate record detection the keys were sorted in many and various fields. When fields matched perfectly, a weighted matching of the remaining fields was used. The algorithm was completed with a page/year and author/title sorting. Online Computer Library Center (OCLC): MDBUPD This program was created by OCLC shortly after 1976; it was named Master Data Base Update (MDBUPD) and was used offline. This algorithm was designed as a two-step application (Wanninger, 1982). Initially, it searched the database using LCCN and keys produced by OCLC. These keys were derived from the name/title fields or just from the title field. Then, it checked additional fields for verification. These were: Publisher, Place of publication, Title, Date of publication, Pagination. Towards the end of this process, after the absolute matching of all compared fields, duplicate records were deleted. University of Illinois: IUCS IUCS (IRRL [Information and Retrieval Research Laboratory] Union Catalog System), was developed to detect non-monographic documents as well as maps, filmstrips, etc. (Williams and MacLaury, 1979). Once the data were normalized, they were processed by comparing fields and applied in two steps/passes. The first step involved the creation of a matching key. The “title-year” keys were sorted and the keys of the documents that were identical were later recalled and compared in the second step (Hickey and Rypka, 1979). For the second step, a number of detailed matching processes were applied so that the first estimation was either verified or rejected. A title mapping key different from that of the first step was used. The author names, titles and pagination of records that were recalled in the previous step as possibly duplicate ones were compared and it was then specified, which were ultimately duplicate ones. The efficiency of this algorithm ranged from 56.58 to 99.62 percent depending on the database which was being tried. Mismatches accounted for 0.54 percent of the total number of duplicate records (Cousins, 1998). When it was not possible to reject or accept records as duplicates, a non-automated comparison of records was used (Hickey and Rypka, 1979). Online Computer Library Center (OCLC) – Hickey and Rypka During 1978-1979 OCLC tried once again to develop a research program for detecting duplicate monographs. This algorithm was developed by Hickey and Rypka and could be applied both online and offline. It was applied in two steps/sections (Hickey and Rypka, 1979): (1) The first step or exact-match section aimed at clustering of related keys in order to reduce the number of full key comparisons. Duplicate detection algorithms 297 (2) In the second step all other keys of selected fields that matched in part or whole were applied. These Keys were derived from the following fields of bibliographic record: Reproduction code, Record type, Title (only the beginning), Publication date, Place of publication, Author, Pages, Publisher and Hashed title. SuDoc number, ISBN, Edition statement, Series, and LCCN were incorporated only f present in bibliographic records. This algorithm was checked against a decision table to determine if the keys were duplicates. This table specified 16 alternative ways by which two keys could be matched. The comparison of the two keys could yield a quote which took any one of the three values: 2 ¼ mismatch, P ¼ partial match, E ¼ exact match. It was found that mismatches were 1.3 percent of the total records identified as duplicates (Hickey and Rypka, 1979). The algorithm located approximately 54-69 percent of duplicate records depending on whether reprints were defined as duplicates or not. Online Computer Library Center (OCLC): DDR In 1990, OCLC created a new algorithm for duplicate record detection. It is applied to monographs and journals and consists of two steps. In the first step, with the application of the clustering algorithm possible duplicate records are clustered with the use of a key consisting of eight characters after the data have been normalized. Only records with the same key titles using seven more elements are included. These elements include LCCN, ISBN, Publication date, Pages, Author, Publisher, and Full title. Records with the same key titles and identical LCCN or ISBN, either identical at least two out of the other five elements are considered as possibly duplicates (O’Neill and Oskins, 1990). In the second step, the evaluation algorithm is applied. This estimates the similarity between possible duplicate records. The similarity values range from “0.0” for not identical ones to “1.0” for the absolutely identical records (O’Neill and Oskins, 1990). The elements are considered partial matches if their similarity is greater than 0.85 percent. When no automated decision is possible, the records are identified for non-automated control. Research showed that the recall of clustering is 96 percent and that 56 percent of the total duplicate records can eventually be detected (O’Neill and Oskins, 1990). This algorithm led to the creation of the DDR software which is used to specify and merge duplicate records representing books and periodicals. Although it can run offline, OCLC has chosen to apply it as an offline procedure. Consortium of University Research Libraries (CURL): COPAC COPAC, the union catalog of the members of CURL, has been in use since 1996. The process of duplicate record detection follows two distinct practices. The first practice deals with the process of detection of duplicate records that is applied only in one part of the database (the second practice is descried in the “Detection and merging on the fly” chapter). The process takes place offline with the aim of merging duplicate records. It is applied in two steps/stages. Step one: each imported record is compared to very record in the database. To achieve this, two methods are used (Cousins, 1998): LHT 26,2 298 (1) Matching ISBN/ISSN: clusters of matching records are located based on ISBN. After the text is normalized, matching fields are assigned weights/values. In the end, the values of all fields are added up. If the total assigned weight is equal or bigger than 13, the record is identified for merging. If the record has an edition statement, matching of this field is also necessary. In the same way, checks for series volumes and multi-volume works take place (Cousins, 1998). (2) Matching of author/title acronym: records without ISBN or ISSN and records with ISBN/ISSN which fail to find a similar record are re-examined with the use of an acronym author/title, 4/4 letters of author and publication year. Possible matches are promoted to the next step. At this point no weighting is determined and for each field matching is a simple YES/NO. Matching based on acronyms introduces the matching of two new fields: publisher and total number of pages (Cousins, 1998). Step two: In order to verify possible matching records, a number of detailed matchings take place. The fields used in this process are: ISBN, ISSN, Publication date, Title, Author, Edition statement, Series, Pagination, and Publisher. COPAC still continues to apply the process described above, but part of its process is done during the process of searching by end users. This part of the process is presented in the following section. Detection and merging on the fly All processes of algorithm applications for duplicate record detection aim primarily at the deduplication or merging of duplicate records. Another practice is the application of the program on the fly. Detection and merging of duplicate records is done during the search or retrieval of records and does not lead to their physical merging, but just to a temporary or “virtual” merging for reasons of presentation to end users. Two programs that apply this method are described below. COPAC: detection and merging of records upon search The majority of the duplicate record detection and merging process continues to take place offline as described previously. A process of three sets of data loading is applied which leads to the creation of three segments in the database (Cousins, 2006): . One set is the data from the British library. These records do not consolidate. . The other two data sets, each consisting of records from approximately half of the other COPAC libraries, have their records consolidated into a specific segment during the data loading using the process described earlier. There is no record consolidation between the three segments, which leads to the existence of duplicate records between them. To compensate for this problem, a check for duplicate records is performed as an on the fly process. When a user searches, results are checked for any possible duplicate records before they are displayed to the user. When duplicate records are found, they are displayed to the user as just one record. This record includes all information from the other records that are included in the result set. This matching and consolidation process during loading time, combined with the process of matching during the search process, is a substantial compromise Duplicate detection algorithms 299 compared to actual detection and merging of duplicate records with large amounts of data. MELVYL: detection and merging of entries upon retrieval The network of the University of California libraries supports the entire system, which runs duplicate record merging on the fly. The records are not merged physically but they are merged and presented dynamically during the search process. Apart from book and journal records, the monographs algorithm is applied to in-analytics, as well as to other non-print materials. This algorithm is applied when each new record is loaded to the MELVYL database, which is basically an offline process. Every time a new record is loaded, its possible identical records are located and the new result is saved in an Oracle table. If a record matches a user’s search criteria, the system automatically checks this table and the best record is recalled (Campbell, 2006). A two-step process is followed for the advancement of identical records to the final phase of merging. Initially, a pool of possible duplicate records is created. In the first step, there is a comparison of LCCN/ISBN, publication year, and the first twenty-five characters of the title. At this point a threshold weight is assigned. The threshold for merging monograph records is 875 points. If during the first step of comparisons identification for merging is not achieved, a second step of comparisons is performed based on data from the title, main entry (normalized), country of publication, pagination, and publisher. Conclusion This paper examined the algorithms applied to eliminate the problems caused by the existence of duplicate bibliographic records in a database. When algorithms are applied in one step, a faster application is achieved but the percentage of database cleanup usually remains low. Most algorithms are two step applications., These result in a greater database quality improvement, since with the initial application of a short key, all possible duplicate records are collected and therefore file the rest of the algorithm is applied only to this new file. The methods used for duplicate matching evaluation are field comparisons and weight assignment. Almost all algorithms studied so far run offline. Also presented is the application of another approach which facilitates a temporary consolidation as a user carries out a search or during the recall stage (on the fly process). The result of this method is not the physical merging of duplicate records in the database but their temporary or “virtual” consolidation for the purpose of presentation to the user. For the creation and selection of the appropriate duplicate records handling algorithms, there is neither an absolute and specific solution, nor a system or a tool which can be simply transferred and applied purely from one environment to the other. Each environment has its own specifications and policies; it applies specific practices and has specific and special needs. In every system the application of these algorithms calls for a special study and modifications to correspond to the given needs. The focus of future research is the handling of large scale data in a network environment and in real time. Virtual catalogs and Z39.50 protocol are the focus of future study. Users wish for a comprehensive, updated, clear, consistent, and fast catalog, which is capable of incorporating searches between distributed databases in a heterogeneous network with consistency, accuracy and speed. Further research on LHT 26,2 300 conventional ways of duplicate record management including the most current practices such as virtual merging is needed. This research is important in order to fully understand a problem to which no satisfactory solutions have been found while at the same time the needs for such solutions are constantly increasing. References Campbell, C. (2006), Melvyl Project Coordinator, information given by e-mail, (accessed 31 January 2006). Cousins, S.A. (1998), “Duplicate detection and record consolidation in large bibliographic databases: the COPAC database experience”, Journal of Information Science, Vol. 24 No. 4, pp. 231-40. Cousins, S. (2006), COPAC Service, Manchester Computing, University of Manchester, available at: copac@mcc.ac.uk (accessed 11 January 2006). Coyle, K. and Gallaher-Brown, L. (1985), “Record matching: an expert algorithm”, ASIS Proceedings, Vol. 4 No. 1, pp. 77-80. Coyle, K. (1992), Rules for merging MELVYLw Records, Technical Report No. 6, University of California, DLA, Oakland, CA. Hickey, T.B. and Rypka, D.J. (1979), “Automatic detection of duplicate monographic records”, Journal of Library Automation, Vol. 2 No. 12, pp. 125-42. Hunstad, S. (1988), “Norwegian bibliographic databases and the problem of duplicate records”, Cataloguing and Classification Quarterly, Vol. 8 Nos 3/4, pp. 239-48. ILCSO (2004), Using OCLC for ILLINET Online/Voyager Data Entry, Illinois Library Computer Systems Office, available at: http://office.ilcso.illinois.edu/Docs/using_OCLC.pdf (accessed 15 February 2007). Lazinger, S.S. (1994), “To merge and not to merge – Israel’s Union List of Monographs in the context of merging algorithms”, Information Technology and Libraries, Vol. 13 No. 3, pp. 213-9. Meir, D.D. and Lazinger, S.S. (1998), “Measuring the performance of a merging algorithm: mismatches, missed-matches, and overlap in Israel’s Union List”, Information Technology and Libraries, Vol. 17 No. 3, pp. 116-23. O’Neill, E. and Oskins, W.M. (1990), Duplicate Records in the Online Union Catalog, OCLC Office of Research, Dublin, OH. Toney, S.R. (1992), “Cleanup and deduplication of an international bibliographic database”, Information Technologies and Libraries, Vol. 11 No. 1, pp. 19-28. Vougiouklis, G. (2007), ELiDOC, available at: gvoug@elidoc.gr (accessed 2 February 2006). Wanninger, P.D. (1982), “Is the OCLC database too large? A study of the effects of duplicate records in the OCLC system”, Library Resources and Technical Services, Vol. 26, pp. 353-61. Williams, M.E. and MacLaury, K.D. (1979), “Automatic merging of monographic data bases: identification of duplicate records in multiple files: the IUCS Scheme”, Journal of Library Automation, Vol. 12 No. 2, pp. 156-68. Corresponding author Anestis Sitas can be contacted at: sitas@lit.auth.gr Duplicate detection algorithms 301 To purchase reprints of this article please e-mail: reprints@emeraldinsight.com Or visit our web site for further details: www.emeraldinsight.com/reprints