The Code4Lib Journal – Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems Mission Editorial Committee Process and Structure Code4Lib Issue 22, 2013-10-14 Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems The ArchiveGrid discovery system is made up in part of an aggregation of EAD (Encoded Archival Description) encoded finding aids from hundreds of contributing institutions. In creating the ArchiveGrid discovery interface, the OCLC Research project team has long wrestled with what we can reasonably do with the large (120,000+) corpus of EAD documents. This paper presents an analysis of the EAD documents (the largest analysis of EAD documents to date). The analysis is paired with an evaluation of how well the documents support various aspects of online discovery. The paper also establishes a framework for thresholds of completeness and consistency to evaluate the results. We find that, while the EAD standard and encoding practices have not offered support for all aspects of online discovery, especially in a large and heterogeneous aggregation of EAD documents, current trends suggest that the evolution of the EAD standard and the shift from retrospective conversion to new shared tools for improved encoding hold real promise for the future. By M. Bron, M. Proffitt and B. Washburn Introduction ArchiveGrid is an aggregation of nearly two million archival material descriptions, including MARC records from WorldCat and finding aids harvested from the web. It is supported by OCLC Research as a corpus for experimentation and testing in text mining, data analysis, and discovery system applications and interfaces. Archival collections held by thousands of libraries, museums, historical societies, and archives are represented in ArchiveGrid. Although roughly 90% of what is in ArchiveGrid are MARC records, as of April 2013 OCLC Research had harvested 124,009 EAD encoded finding aids for inclusion in ArchiveGrid[1]. This small segment of ArchiveGrid is important because EAD has been embraced by the archival community since it’s inception in the 1990s, and is supported by a range of tools designed specifically for archives, such as ArchivesSpace, Archivists’ Toolkit, Archon, CALM, and others. In creating the ArchiveGrid discovery interface, the project team has wrestled with what we can reasonably do with this corpus. For example, it would be useful to be able to sort by size of collection, however, this would require some level of confidence that the tag is both widely used and that the content of the tag would lends itself to sorting. Other examples of desired functionality include providing a means in the interface to limit a search to include only items that are in a certain genre (for example, photographs) or to limit a search by date. Again, we would need to have confidence that the metadata we have will actually support these features, and not leave out potentially important collections simply because of the absence of certain tags. Specifically, we will consider how the variability of use of elements in finding aids affects discovery considering five different possible dimensions of a discovery system: search, browse, sort, limit, and display. As a warning to the reader: this paper delves deeply into EAD elements and attributes and assumes at least a passing knowledge of the encoding standard. For those wishing to learn more about the definitions and structure, we recommend the official EAD website or the less official but highly readable and helpful EADiva site[2]. Related Work The work that is the most closely related to our research was done by Katherine M. Wisser and Jackie Dean[1]. In 2010 Wisser and Dean solicited EAD files repositories from institutions in order to ”identify encoding behavior.”[3] In total, 108 repositories submitted up to 15 finding aids for the analysis; 1,136 finding aids comprise the entire sample. The formal results of their analysis will be published in the Fall 2013 edition of American Archivist. We are grateful to the authors for sharing their early work with us, and note with interest that in many cases, their analysis of EAD usage is quite similar. However, in some notable cases, the findings from the two samples diverge dramatically (see for example elements in above the in Table 9). As noted by Wisser and Dean some of this variation can be attributed to the many different ways in which EAD files can be obtained. Wisser and Dean invited a limited contribution (12-15 finding aids) from a wide variety of repositories, including significant contributions from institutions outside of the US; even though Wisser and Dean carefully articulated that results would be anonymized, there is some chance that the results were somewhat skewed by the process of selecting files for inclusion. By contrast, our data set was assembled by harvesting EAD documents from institutions directly, see below. Contributing institutions have been motivated to contribute to ArchiveGrid primarily to share information about their collections, not their EAD practices. Additionally, ArchiveGrid is primarily constituted by repositories from the United States, with few institutions from Europe or elsewhere represented in the data set. Either or both of these key differences may account for divergence in findings between our work and that of Wisser and Dean. The 2010 report, “Implications of MARC Tag Usage on Library Metadata Practices” focused on an analysis of the MARC standard as reflected in World-Cat [5]. Although the emphasis of the the report was, similar to Dean and Wisser, meant to “inform community practice,” a secondary purpose was to draw conclusions about the suitability of MARC data for machine matching and processing, which is similar to our desire to identify functionality (and gaps in functionality) that exist in our current EAD corpus. OCLC Research regularly harvests EAD documents from contributing institutions to update their representation in the ArchiveGrid index.  The update cycle is roughly every six weeks.  Institutions are contacted to obtain their permission to harvest and use the data in ArchiveGrid, and to identify the target URLs and rules for selection.  For some contributors, the harvesting rules are simple: a directory listing or an HTML page is made available to our crawler, with every link leading to an EAD XML file on the contributor’s server.  For other contributors we may make use of a website designed for human visitors, applying custom include and exclude rules to the URLs we find to select only links to EAD documents.  Though OAI-PMH repositories and other more specialized harvesting protocols may be available at some contributor sites, we have seen little interest among contributors in their use, and currently we are using only standard HTTP GET requests for all the many hundreds of EAD document providers.  Maintaining the EAD harvesting operation continues to be a significant component of the ArchiveGrid support costs covered by OCLC Research. Methods Defining Thresholds It is difficult to predefine thresholds for the level of usage of an element at which it becomes more or less useful for discovery. Is an element that is used 95% of the time still useful but one that is used 94% not? In this paper we consider the thresholds resulting from working with our sample of documents. We will use the terminology documents and finding aids interchangeably throughout the paper. As an indicator for usage of an element we use the percentage of documents that contain the element at least once (% uniq). The nested nature of finding aids, however, influences the usage of elements as the absence of a parent element reduces the possibility of the occurrence of child elements. As an alternative indicator for usage we use the percentage of documents that contain an element in the sample of documents that contain the element’s parent element (% uniq in C). Figure 1 shows how often the percentage of usage of an element falls into certain intervals. Note that we use relative usage (% uniq in C) here. The distribution of element usage could be roughly divided into 4 groups: (i) usage between 0%-50% or low use; (ii) usage between 51%-80% or medium use; (iii) usage between 81%-95% or high use; (iv) usage between 96%-100% or complete use. Although we will use these levels as a reference point in this document, we do so with a recognition that correlating usage with discovery is an artificial construct. In the absence of a more effective approach, we are using these levels as an initial framework for discussion. The absence of an element does not directly lead to a breakdown in a discovery system. It is more like a gradual decay of the effectiveness of a discovery system. But not all elements are created equally – in current archival discovery systems, we see a range of functionality that is offered, both in terms of search and advanced search options, as well as sorting features, and results display. Within smaller aggregations, we might very well expect tag usage to be considerably more internally consistent than is the case in the ArchiveGrid aggregation. But in the case of ArchiveGrid and similar large aggregations of finding aids, what functionality can be reasonably supported, given the present state of the data? What functionality can we offer with assurance, if we look only at elements that are in the high or complete categories? Figure 1: The distribution of percentage of element usage (% uniq in C). Elements are nested and the absence of a parent element influences the occurrence percentage of a child element. For this reason we use the number of element occurrences relative to the occurrences of the parent element (% uniq in C). Counting Element Occurrences Finding aids follow the Encoded Archival Description standard, which is a complex XML structure. As an example of the complexity of EAD in implementation, we found more than 26,000 paths in our 129,009 document set. To provide a starting point for obtaining element counts we recreated the many (but not all) tables of element, attribute, and value counts as presented in the report by Wisser et al. [4]. Each table was recreated by performing one or more XPath queries over the corpus of finding aids. In the discussion of our analysis we do not follow the same structure as in Wisser et al. [4] as our focus is on implications of element usage on discovery and presentation. Where appropriate similarities and differences between element usage in our sample of finding aids and those used in Wisser et al. [4] are reported. In the rest of the paper we use the following notation in our tables: (i) N is the total number of occurrences of an element; (ii) N uniq is the number of documents in which the element occurs at least once; (iii)   is the percentage of documents in our sample of EAD documents (S= 124009) that contain the element at least once; and (iv)  is the percentage of documents that contain the element in the sample of documents (n=…) that contain a certain element. We will provide the size of each particular sample explicitly. For example, when considering the element that occurs in every document we get , which is the same as . We use  to indicate the percentage of documents that contain the element in the sample of documents that contain a certain element as collected by Wisser. In most cases the sample size will be all documents in Wisser’s sample, i.e., . Finally, we use diff to indicate the percentage point difference between the percentage Nuniq and Nuniqk, i.e., between Wisser’s and our sample. Dimensions for Analysis Our analysis considered the following dimensions: search: all discovery systems have a keyword search function; many also include the ability to search by a particular field or element: examples include name, date, subject. browse: many discovery systems include the ability to browse finding aids: examples include browse by repository, browse by material type. results display: once a user has done a search, the results display will return portions of the finding aid to help with further evaluation: examples include title, dates, collection size. sort: once a user has done a search, they may have the option to reorder the results. Examples include: order by date, order by title, order by size. limit by: once a user has done a search, they may have the option to narrow the results to only include results that meet certain criteria. This may be done through presentation of facets: examples include limit by collections with digital material, limit by repository. Current discovery interfaces We reviewed a number of different discovery interfaces for finding aids in order to provide an overview of the type of search, browse, sort, limit, and display options that are generally available. Interfaces included are: the Online Archive of California (http://www.oac.cdlib.org/), the Northwest Digital Archive (http://nwda.orbiscascade.org/), Texas Archival Resources Online (http://www.lib.utexas.edu/taro/index.html), Arizona Archives Online (http://www.azarchivesonline.org/xtf/search), the Five Colleges Archives and Manuscripts Collection (http://asteria.fivecolleges.edu/index.html), the Rocky Mountain Online Archive (http://rmoa.unm.edu/), the Harvard Library’s Online Archival Search Information System (http://oasis.lib.harvard.edu/oasis/deliver/home?_collection=oasis). The interfaces we surveyed are very traditional in the capabilities they support — this is no doubt in part an outcome of the type of functionality that is supported in EAD 2002. In addition to assessing the suitability of the ArchiveGrid corpus for some general archival-specific discovery interfaces, we wanted to cast our net a little wider and speculate on how well EAD may meet the needs of emerging NextGen (or NowGen!) approaches to discovery that may not be represented in our interfaces surveyed, or supported by 2002 era EAD. Emerging discovery apparatus include: Support for geo-locating archival locations, subjects of collected materials, and other elements, to server map-based search interfaces.  Examples of map-based discovery interfaces include: HistoryPin (http://www.historypin.com/), WhatWasThere (http://www.whatwasthere.com/), Historvius (http://www.historvius.com/) Similarly, we see support for event-based retrieval, using timelines or similar devices, as an area in which discovery systems are evolving.  Some examples include: SIMILE, example project timeline for Jewish History http://simile.mit.edu/timeline/examples/religions/jewish-history.html, Timeline view, Philippine Archives Collection, NARA http://www.archives.gov/research/military/ww2/philippine/timeline.html Zagora Archaeological Project http://www.powerhousemuseum.com/zagora/timeline/ Analysis Details We now take a closer look at which elements might drive each function, how the aggregated data fits this purpose both in terms of meeting our  thresholds, and how well the content of key elements are fit for purpose. With each element, we’ve included a note about how they are used in ArchiveGrid and in other discovery systems. Date Our analysis shows use of within the high-level as medium (72.64% — see Table 7); This makes values less than reliable for functions such as sort and limit by. Consider, for example, a scenario where a researcher is interested in material from the Second World War. Filtering by a date range between 1939-1945 will result in only those documents being presented that have a assigned in that period and may lead to the researcher missing potentially relevant documents. Alternatively, only those documents could be excluded that have a date outside of the indicated range. However, with a large amount of EADs missing a field this approach defeats the purpose of filtering. Investing effort to bring this element closer to high or complete may be warranted; however, to support dimensions beyond just display, the content of the field or contents of the “normal” attribute must be easily parseable. When we look at the content of , we find a wide range of descriptive practices, some of which could pose problems for machine parsing to support use in indexing and retrieval. Another issue involved in using the field is that it can be used in several places, e.g., on its own in the top level or as a subelement of . Comparing the usage of in our collection of EAD documents and that of Wisser, we find that it is one of the elements where we see the greatest divergence, i.e., Wisser’s sample shows a usage of in the of 97.00%. In ArchiveGrid, dates are used in: search: they are keyword searchable display: with the collection title (when available) in brief displays In other Archival Discovery Systems: search browse sort display Extent Our analysis shows use of within the high-level as medium (70.43% — see Table 8); as with , the content of is quite varied and does not easily facilitate sorting, with values ranging from “miscellaneous artifacts” to “2 ceramic heads.” The syntax of the element (with attributes for @encodinganalog, @type, and @unit) does not currently lend itself to structuring data in a way that can be used for sorting without clear guidelines, tools to enforce appropriate encoding, and rigor on the part of institutions; retrospectively refitting to be utilized in sorting could be a daunting challenge for many institutions. Many documents in the ArchiveGrid corpus have multiple statements, further complicating matters, as the system would need to decide which one to sort, for example. For display, including   statements in order to help aid researchers in evaluating results seems fit to purpose. In ArchiveGrid, extent is used in: search: extent values are keyword searchable display: presented in brief displays and separately in the display of individual collection descriptions In other Archival Discovery Systems: sort display Collection Title Our analysis shows use of in the high-level as complete (99.93% — see Table 7); this would suggest that it is suitable for all uses. However, for sorting and browsing, again, utility depends on the content of the element. If the content of the element is something generic like “Records” or “Papers” (in cases where perhaps the creator has been recorded separately in the origination element), then all functions may be less than ideal, but particularly sorting by title or creating browse lists. Many interfaces either construct browse lists of collections titles, or allow users to sort results by title, or search within titles. Not surprisingly, we found that the required element in the to be complete. Although our analysis did not include elements below , we can assume that the required and its required child, will be similarly complete. The fact that is fully populated is good news for searching and display; however for sorting and constructing browse lists, we would need to have some assurance that the contents of are fit to purpose. This would be an opportunity for further evaluation, although a quick scan of the contents of encouragingly revealed that 42% of ArchiveGrid finding aids have a @type attribute with the value “filing”, which is rather remarkable as there is no specified list of values for type. In ArchiveGrid, collection titles are used in: search: they are keyword searchable display: collection titles appear in brief search results In other Archival Discovery Systems: sort browse display Subject Our analysis shows use of as medium (72.89% — see Table 9); is the parent element of both subject as well as other access points (such as , , , and ). Our analysis did not include drilling down to use of subelements. (Given differences in library and archival practices, we would expect control of form and genre terms to be relatively high, and control of names and subjects to be relatively low.) In ArchiveGrid, subjects are used in: limit by: we show values for people, groups, places and topics as Result Overview facets for limiting a search result In other Archival Discovery Systems: search browse Material type Researchers may wish to limit to or seek out material in a specific format, and our survey of discovery systems reveal that some systems support this functionality. Our analysis did not include the children of , which includes . In ArchiveGrid, material type is used for: search: material types in are keyword searchable In other Archival Discovery Systems: search browse limit by Names (personal or corporate) Names can be found in multiple places — for the the creator of a collection, is most logically found in , where both and are child elements.  The use of the origination tag is medium (87.78% – see Table 7); our analysis did not include evaluation of the use of and in origination. Otherwise, personal and corporate names as access points may be found in (see above). Name elements occur ubiquitously in EAD version 2002, and our analysis did not include a detailed inventory of and in the many places they can occur. A weakness of the distributed nature of names throughout EAD documents is that without detailed annotations and co-references, discovery systems only have a shallow understanding of names and their relationship to the collection and to one another. Discovery systems are not always able to differentiate between names when used in a creator context versus those covered in the description, which may show up as access points. In ArchiveGrid, names are used for: search: names are keyword searchable limit by: names for people, groups and places appear in the Result Overview In other Archival Discovery Systems: Used in search Used for limiting Repository The name of the repository is found in the high-level did in . Use of this element falls into the promising complete category (99.46%: see Table 7). However, a variety of practice is in play, with the name of the repository being embellished with and
tags nested within . To avoid the difficulties in handling these variations across a range of contributing institutions, ArchiveGrid maintains a separate system to manage the form of the institution name for use in the system. In ArchiveGrid, is not used as an access point, though ArchiveGrid’s separately administered and controlled form of the repository name is used for search, browse, sort, limit and display. In other Archival Discovery Systems, used in: browse limit by Scope note, biographical note, abstract Our analysis shows use of as high (84.41% — see Table 9), while (70.42% — see Table 9) and (79.20% — see Table 7) are medium; all three are suitable for search and for display in a results view, although they can be quite lengthy. For search, its worth noting that the semantics of these elements are different, and may result in unexpected and false “relevance” for matches against descriptions in (about the person) and and (which may be more about the collection). In ArchiveGrid, these notes are used in: search: notes are keyword-searchable display: notes appear (in truncated form if lengthy) in brief search results In other Archival Discovery Systems, used in: search display (in snippets or in their entirety) Collections with digital content Our analysis did not explore the use of or elements, which can be used in a variety of places in EAD 2002. Wisser and Dean found that is used in 7.7% and 9.3% of the documents in their sample, putting both into the low category (see Wisser, Table 26, elements). However, with growing interest in digitized materials from archival collections, identifying those materials is of increasing importance. In ArchiveGrid, we provide no mechanism for searching or identifying collections with digital content. In other Archival Discovery Systems: Limiting results to those with digital content Flagging collections with digital content Future Work In order to make EAD-encoded finding aids more well suited for use in discovery systems, the population of key elements will need to be moved closer to high or (ideally) complete. However, it is not only a matter of populating the elements, but ensuring that the data will reliably power key aspects of discovery systems. This will take concerted effort and tools, both on the part of individual institutions and groups. In the analysis of “NextGen” discovery services, we noted the use of geolocation-based discovery. Although we would need to do further analysis in to assess the usage for in our document set, the current structure of the element does not support geolocation functionality. However, as part of the redesign for EAD3, EAD is becoming more supportive of linked data and linked data structures. This may offer some hope for retrofitting EAD data to be more suited for the task of meeting map-based discovery requirements. Likewise, the data we have on hand does not suggest good support for event-based discovery, which would draw on well-structured dates, geographic subject terms, and topical subject terms (such as “Battle of Alma” or “Great Depression”). Again, EAD 2002 does not support the sort of encoding that would be necessary to serve event-based discovery, but EAD3 may provide more appropriate structures. An Optimum Threshold for Discovery? The picture for archival discovery and EAD is decidedly mixed. On the one hand, we have elements that are in high or even complete use. On the other hand, we have many elements that are necessary for discovery interfaces that are in medium use; and even with elements that are in high or complete use, the contents of those tags are not always fit to purpose. This can be at least partly explained by EAD’s history.  In the early days of EAD the focus was largely on moving finding aids from typescript to SGML and XML. Even with much attention given over to the development of institutional and consortial best practice guidelines and requirements, much work was done by brute force and often with little attention given to (or funds allocated for) making the data fit to the purpose of discovery. Tag analyses such as the work described in this paper can help inform the development and implementation of the EAD schema (indeed the work done by Wisser and Dean was considered in the development of EAD3).  But our analysis suggests that the standard has most of the elements and attributes needed to effectively support discovery; what’s missing is agreement on and widespread application of best practices tied to supporting discovery. So, is the container list half empty? If the archival community continues on its current path then the potential of the EAD format to support researchers or the public in discovery of material will remain underutilized. Minimally, collection descriptions that are below the thresholds for discovery will hinder their discovery efforts and maximally will remain hidden from view. Our paper provides suggestions for the elements where additional effort and investment are warranted to improve their utility for discovery systems. (We recognize that for some institutions, that additional effort may not be feasible or warranted; for their purposes they may find that HTML or PDF collection descriptions suffice.) Or is the container list half full? Perhaps with emerging evidence about the corpus of EAD, continued discussion of practice, recognition of a need for greater functionality, and shared tools both to create new EAD documents and improve existing encoding, we can look forward to further increasing the effectiveness and efficiency of EAD encoding, and develop a practice of EAD encoding that pushes collection descriptions across the threshold of discovery. Tables Table 1: (Wisser Table 1): General statistics for EAD finding aids, using queries: /ead/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1136] diff eadheader 124009 124009 100.00 100.00 100.00 0.00 archdesc 124009 124009 100.00 100.00 100.00 0.00 frontmatter 46115 46115 37.19 37.19 24.60 12.59 eadgrp 0 0 0.00 0.00 0.00 0.00 archdescgrp 0 0 0.00 0.00 0.00 0.00 dscgrp 0 0 0.00 0.00 0.00 0.00 Table 2: (Wisser Table 2): Elements used within eadheader, using query /ead/eadheader/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1136] diff eadid 124445 124008 100.00 100.00 100.00 -0.00 filedesc 124009 124009 100.00 100.00 100.00 0.00 profiledesc 123103 123103 99.27 99.27 98.10 1.17 revisiondesc 42504 42501 34.27 34.27 32.70 1.57 Table 3: (Wisser Table 3) Attributes used with eadheader, using query //eadheader. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1136] diff countryencoding 107412 107412 86.62 86.62 89.50 -2.88 dateencoding 107377 107377 86.59 86.59 88.20 -1.61 findaidstatus 42910 42910 34.60 34.60 27.80 6.80 langencoding 117641 117641 94.86 94.86 95.00 -0.14 repositoryencoding 106370 106370 85.78 85.78 87.80 -2.02 scriptencoding 95230 95230 76.79 76.79 77.60 -0.81 Table 4: (Wisser Table 4): Attributes used with eadid, using query //eadid. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1136] diff countrycode 108668 108667 87.63 87.63 94.30 -6.67 mainagencycode 105351 105350 84.95 84.95 92.60 -7.65 publicid 45758 45758 36.90 36.90 31.10 5.80 url 38020 38020 30.66 30.66 42.30 -11.64 urn 2312 2312 1.86 1.86 3.90 -2.04 identifier 57260 57260 46.17 46.17 49.30 -3.13 Table 5: (Wisser Table 8): Elements within frontmatter, using query /ead/frontmatter/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=46115] % [(N_uniqK)/n=279] diff titlepage 45726 45726 36.87 99.16 92.80 6.36 div 190 190 0.15 0.41 2.20 -1.79 Table 6: (Wisser Table 9): Values for @level within archdesc, using query //archdesc/@level. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1,136] diff collection 116957 116957 94.31 94.31 90.90 3.41 fonds 135 135 0.11 0.11 4.80 -4.69 class 9 9 0.01 0.01 0.30 -0.29 recordgrp 433 433 0.35 0.35 1.40 -1.05 series 2394 2394 1.93 1.93 0.60 1.33 subfonds 49 49 0.04 0.04 0.30 -0.26 subgrp 526 526 0.42 0.42 1.00 -0.58 subseries 46 46 0.04 0.04 0.00 0.04 file 2446 2446 1.97 1.97 0.40 1.57 item 987 987 0.80 0.80 0.30 0.50 otherlevel 25 25 0.02 0.02 0.10 -0.08 Table 7: (Wisser Table 10): Elements within archdesc/did, using query /ead/archdesc/did/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1,136] diff abstract 102792 98218 79.20 79.20 86.60 -7.40 container 5447 3471 2.80 2.80 0.40 2.40 langmaterial 112938 109232 88.08 88.08 89.90 -1.82 materialspec 41 41 0.03 0.03 1.60 -1.57 origination 113684 108853 87.78 87.78 89.00 -1.22 physdesc 135126 122402 98.70 98.70 97.20 1.50 physloc 53564 45620 36.79 36.79 27.80 8.99 repository 123343 123330 99.45 99.45 99.60 -0.15 unitdate 97247 90080 72.64 72.64 97.00 -24.36 unitid 119911 114898 92.65 92.65 90.10 2.55 unittitle 123959 123916 99.93 99.93 100.00 -0.07 Table 8: (Wisser Table 11): Elements within archdesc/did/physdesc, using query /ead/archdesc/did/physdesc/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1,136] diff dimensions 666 576 0.46 0.46 1.80 -1.34 extent 122613 87339 70.43 70.43 76.30 -5.87 physfacet 2000 1520 1.23 1.23 1.70 -0.47 Table 9: (Wisser Table 12): Elements within archdesc:above the dsc, using query /ead/archdesc/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1,136] diff accessrestrict 55751 55579 44.82 44.82 86.20 -41.38 accruals 694 694 0.56 0.56 7.10 -6.54 acqinfo 40668 40451 32.62 32.62 68.00 -35.38 altformavail 2293 2289 1.85 1.85 12.70 -10.85 appraisal 4613 4602 3.71 3.71 4.80 -1.09 arrangement 40979 40627 32.76 32.76 65.50 -32.74 bibliography 4573 4083 3.29 3.29 10.10 -6.81 bioghist 89103 87333 70.42 70.42 87.30 -16.88 controlaccess 92124 90390 72.89 72.89 85.00 -12.11 custodhist 8375 8366 6.75 6.75 14.10 -7.35 descgrp 67684 56446 45.52 45.52 32.00 13.52 fileplan 50 44 0.04 0.04 0.60 -0.56 index 1231 656 0.53 0.53 1.20 -0.67 odd 9594 8145 6.57 6.57 9.70 -3.13 originalsloc 988 973 0.78 0.78 3.40 -2.62 otherfindaid 6529 6271 5.06 5.06 11.90 -6.84 phystech 900 897 0.72 0.72 4.20 -3.48 prefercite 49015 48989 39.50 39.50 85.40 -45.90 processinfo 27249 26623 21.47 21.47 0.00 21.47 relatedmaterial 23932 23676 19.09 19.09 40.30 -21.21 runner 10822 10822 8.73 8.73 1.10 7.63 scopecontent 105384 104670 84.41 84.41 93.40 -8.99 separatedmaterial 5789 5691 4.59 4.59 14.80 -10.21 userestrict 41365 40749 32.86 32.86 68.30 -35.44 Table 10: Table 13: The inclusion of dsc in finding aids, using query //dsc. Element N N_uniq % [N_uniq/S] % [N_uniq/n=124009] % [(N_uniqK)/n=1,136] diff < dsc > 98663 94473 76.18 76.18 90.30 -14.12 multiple < dsc > s 98663 2075 1.67 1.67 2.40 -0.73 Table 11: (Wisser Table 14): dsc type attributes, using query //dsc/@type. Element N N_uniq % [N_uniq/S] % [N_uniq/n=99023] % [(N_uniqK)/n=1,105] diff analyticover 3156 3149 2.54 3.18 5.10 -1.92 combined 49205 49184 39.66 49.67 66.50 -16.83 in-depth 36433 35876 28.93 36.23 16.70 19.53 othertype 1725 1572 1.27 1.59 3.50 -1.91 Table 12: (Wisser Table 15): c-c12 tags, using query //c | //c01 | //c02 | //c03 | //c04 | //c05 | //c06 | //c07 | //c08 | //c09 | //c10 | //c11 | //c12. Element N N_uniq % [N_uniq/S] % [N_uniq/n=96548] % [(N_uniqK)/n=1,053] diff c 4745698 14440 11.64 14.96 11.10 3.86 c01 1650659 78600 63.38 81.41 88.00 -6.59 c02 7432993 59217 47.75 61.33 72.50 -11.17 c03 6625963 29136 23.50 30.18 41.80 -11.62 c04 2927180 12819 10.34 13.28 20.60 -7.32 c05 1312217 5587 4.51 5.79 10.70 -4.91 c06 598647 2266 1.83 2.35 4.60 -2.25 c07 261648 922 0.74 0.95 2.00 -1.05 c08 90401 331 0.27 0.34 0.70 -0.36 c09 21514 110 0.09 0.11 0.30 -0.19 c10 3578 36 0.03 0.04 0.10 -0.06 c11 823 7 0.01 0.01 0.00 0.01 c12 96 2 0.00 0.00 0.00 0.00 Table 13: (Wisser Table 16): Values for level attribute on c, c/@level, using query //c/@level | //c01/@level | //c02/@level | //c03/@level | //c04/@level | //c05/@level | //c06/@level | //c07/@level | //c08/@level | //c09/@level | //c10/@level | //c11/@level | //c12/@level. Element N N_uniq % [N_uniq/S] % [N_uniq/n=96548] % [(N_uniqK)/n=1,053] diff collection 13489 4782 3.86 4.95 2.10 2.85 fonds 418 95 0.08 0.10 0.70 -0.60 class 63134 2113 1.70 2.19 1.20 0.99 recordgrp 1535 193 0.16 0.20 0.70 -0.50 series 398727 58480 47.16 60.57 77.70 -17.13 subfonds 3210 637 0.51 0.66 1.70 -1.04 subgrp 5573 430 0.35 0.45 3.10 -2.65 subseries 466366 16974 13.69 17.58 35.30 -17.72 file 11419524 36262 29.24 37.56 56.90 -19.34 item 3480272 20415 16.46 21.14 24.20 -3.06 otherlevel 368942 6225 5.02 6.45 9.10 -2.65 Table 14: (Wisser Table 17): c-c12/did elements, using query //c/did/* | //c01/did/* | //c02/did/* | //c03/did/* | //c04/did/* | //c05/did/* | //c06/did/* | //c07/did/* | //c08/did/* | //c09/did/* | //c10/did/* | //c11/did/* | //c12/did/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=96548] % [(N_uniqK)/n=1,053] diff abstract 1421043 3850 3.10 3.99 2.50 1.49 container 24951558 72377 58.36 74.96 82.50 -7.54 langmaterial 46798 1127 0.91 1.17 6.10 -4.93 materialspec 22870 106 0.09 0.11 1.30 -1.19 origination 1308346 4090 3.30 4.24 8.10 -3.86 physdesc 3967094 37749 30.44 39.10 54.40 -15.30 physloc 1343791 5978 4.82 6.19 5.80 0.39 repository 34923 29 0.02 0.03 0.30 -0.27 unitdate 9613593 41894 33.78 43.39 90.60 -47.21 unitid 7167784 31035 25.03 32.14 46.20 -14.06 unittitle 25228059 92888 74.90 96.21 98.90 -2.69 Table 15: (Wisser Table 18): c-c12/did/physcdesc elements, using query //c/did/physdesc/* | //c01/did/physdesc/* | //c02/did/physdesc/* | //c03/did/physdesc/* | //c04/did/physdesc/* | //c05/did/physdesc/* | //c06/did/physdesc/* | //c07/did/physdesc/* | //c08/did/physdesc/* | //c09/did/physdesc/* | //c10/did/physdesc/* | //c11/did/physdesc/* | //c12/did/physdesc/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=96548] % [(N_uniqK)/n=1,053] diff dimensions 144079 1378 1.11 1.43 5.20 -3.77 extent 2401903 24495 19.75 25.37 36.60 -11.23 physfacet 164430 613 0.49 0.63 6.80 -6.17 Table 16: (Wisser Table 19): other elements found in c-c12, using query //c/* | //c01/* | //c02/* | //c03/* | //c04/* | //c05/* | //c06/* | //c07/* | //c08/* | //c09/* | //c10/* | //c11/* | //c12/*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=96548] % [(N_uniqK)/n=1,053] diff accessrestrict 600069 4844 3.91 5.02 10.70 -5.68 accruals 12 11 0.01 0.01 0.00 0.01 acqinfo 68066 1477 1.19 1.53 4.50 -2.97 altformavail 252282 766 0.62 0.79 2.70 -1.91 appraisal 48 30 0.02 0.03 0.70 -0.67 arrangement 31945 5746 4.63 5.95 19.00 -13.05 bibliography 2067 48 0.04 0.05 1.50 -1.45 bioghist 12511 1132 0.91 1.17 4.60 -3.43 controlaccess 243134 2149 1.73 2.23 5.10 -2.87 custodhist 26224 181 0.15 0.19 2.20 -2.01 descgrp 2703 31 0.02 0.03 1.80 -1.77 index 386148 835 0.67 0.86 0.70 0.16 note 1180397 11265 9.08 11.67 20.30 -8.63 odd 242182 2663 2.15 2.76 7.20 -4.44 originalsloc 9959 211 0.17 0.22 1.00 -0.78 otherfindaid 1945 247 0.20 0.26 2.30 -2.04 phystech 8439 300 0.24 0.31 1.50 -1.19 prefercite 1995 264 0.21 0.27 0.10 0.17 processinfo 26332 1084 0.87 1.12 3.80 -2.68 relatedmaterial 16727 882 0.71 0.91 4.40 -3.49 runner 0 0 0.00 0.00 0.00 0.00 scopecontent 1852092 33483 27.00 34.68 61.30 -26.62 separatedmaterial 2784 208 0.17 0.22 0.00 0.22 userestrict 2993 580 0.47 0.60 3.20 -2.60 Table 17: (Wisser Table 20): content tags in dsc, using query //dsc//*. Element N N_uniq % [N_uniq/S] % [N_uniq/n=96548] % [(N_uniqK)/n=1,053] diff corpname 373402 6082 4.90 6.30 8.40 -2.10 famname 3644 914 0.74 0.95 1.70 -0.75 function 996 53 0.04 0.05 0.00 0.05 genreform 351956 6988 5.64 7.24 5.10 2.14 geogname 1023771 6653 5.36 6.89 6.30 0.59 name 34339 380 0.31 0.39 1.40 -1.01 occupation 25284 285 0.23 0.30 0.40 -0.10 persname 2610548 11970 9.65 12.40 12.90 -0.50 subject 1239139 2419 1.95 2.51 4.70 -2.19 References [1]  In April 2013, the ArchiveGrid index contained 1,632,246 MARC records, 119,984 EAD records, 61,551 HTML records, and 4,532 PDF records.  The EAD count in the index is lower than the set of documents analyzed, to avoid duplicating their display for certain contributors who supply corresponding MARC records. [2] Library of Congress EAD Website: http://www.loc.gov/ead/index.html; EADiva: http://eadiva.com/. [3] E-mail to archives and archivists listserv, November 15, 2010. [4] Wisser, Katherine M, and Jackie Dean, EAD Tag Usage: Community analysis of the use of Encoded Archival Description elements, article submitted for publication in the American Archivist [5] Smith-Yoshimura, Karen, Catherine Argus, Timothy J. Dickey, Chew Chiat Naun, Lisa Rowlinson de Ortiz, and Hugh Taylor. 2010. Implications of MARC Tag Usage on Library Metadata Practices. About the Authors Marc Bron is a researcher at the Intelligent Systems Lab Amsterdam, where he is about to complete his PhD in Information Retrieval. His dissertation focused on improving accessibility to information stored in cultural heritage institutions by developing algorithms and interactive retrieval systems that support exploration and contextualization. During his PhD Marc has published over 20 papers at top tier conferences, journals, and workshops. His current research direction aims to develop new collaborative search methods for users of archival collections. Bruce Washburn is a Consulting Software Engineer in OCLC Research.  He provides software development support for OCLC Research initiatives and participates as a contributing team member on selected research projects.  In addition, he provides software development support for selected OCLC Products and Services. At OCLC Washburn has been a part of the product teams that developed and maintain CAMIO,  ArchiveGrid, the WorldCat Search API, and OAIster. Merrilee Proffitt is a Senior Program Officer in OCLC Research. She provides project management skills and expert support to institutions represented within the OCLC Research Library Partnership. Merrilee has authored or co-authored articles, guidelines, and reports for a variety organizations and professional journals. She is frequently an invited speaker at international professional conferences and workshops on topics relating to digital libraries and special collections. Her current projects and interests include: archival description, increasing access to special collections, looking at developing better relationships between Wikipedia and cultural heritage institutions, and how Massively Open Online Courseware (MOOCs) may impact libraries. She is a member of the small but mighty ArchiveGrid team. Subscribe to comments: For this article | For all articles Leave a Reply Name (required) Mail (will not be published) (required) Website ISSN 1940-5758 Current Issue Issue 51, 2021-06-14 Previous Issues Issue 50, 2021-02-10 Issue 49, 2020-08-10 Issue 48, 2020-05-11 Issue 47, 2020-02-17 Older Issues For Authors Call for Submissions Article Guidelines Log in This work is licensed under a Creative Commons Attribution 3.0 United States License.