Metadata Provenance and Vulnerability Timothy Robert Hart and Denise de Vries INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2017 24 Timothy Robert Hart (tim.hart@flinders.edu.au) is PhD researcher and Denise de Vries (denise.devries@flinders.edu.au) is Lecturer of Computer Science, College of Science and Engineering, Flinders University, Adelaide, Australia. ABSTRACT The preservation of digital objects has become an urgent task in recent years as it has been realised that digital media have a short life span. The pace of technological change makes accessing these media increasingly difficult. Digital preservation is primarily accomplished by main methods, migration and emulation. Migration has been proven to be a lossy method for many types of digital objects. Emulation is much more complex; however, it allows preserved digital objects to be rendered in their original format, which is especially important for complex types such as those comprising multiple dynamic files. Both methods rely on good metadata to maintain change history or construct an accurate representation of the required system environment. In this paper, we present our findings that show the vulnerability of metadata and how easily they can be lost and corrupted by everyday use. Furthermore, this paper aspires to raise awareness and to emphasise the necessity of caution and expertise when handling digital data by highlighting the importance of provenance metadata. INTRODUCTION UNESCO recognised digital heritage in its “Charter on the Preservation of Digital Heritage,” adopted in 2003, stating, “The digital heritage consists of unique resources of human knowledge and expression. It embraces cultural, educational, scientific and administrative resources, as well as technical, legal, medical and other kinds of information created digitally, or converted into digital form from existing analogue resources. Where resources are ‘born digital’, there is no other format but the digital object.” 1 Born-digital objects are at risk of degradation, corruption, loss of data, and becoming inaccessible. We combat this through digital preservation to ensure they remain accessible and useable. The two main approaches to preservation are migration and emulation. Migration involves migrating digital objects to a different and currently supported file type. Emulation involves replicating a digital environment in which the digital object can be accessed in its original format. Both methods have advantages and disadvantages. Migration is the more common method because it is simpler than emulation and the risks can often be neglected. These risks include potential data loss or change, in which the effects are permanent. Emulation is complex, but it offers the better means to access preserved objects, especially complex file types comprising multiple dynamic files that must be constructed correctly. Emulation also allows users to handle digital objects as closely to the “look and feel” as originally intended. 2 mailto:tim.hart@flinders.edu.au mailto:denise.devries@flinders.edu.au METADATA PROVENANCE AND VULNERABILITY | HART AND DE VRIES 25 https://doi.org/10.6017/ital.v36i4.10146 Accurate and complete metadata is central to both migration and emulation; thus, it is the focus of this paper. Metadata are needed to record the migration history of a digital object and to record contextual information. They are also necessary to accurately render digital objects in emulated environments. Emulated environments are designed around a digital object’s dependencies , which typically include, but are not limited to, drivers, software, and hardware. 3 The metadata describe the attributes of the digital object from which we can derive the type of system in which it can run (e.g., the operating system), the versions of any software dependencies, and other criteria that are crucial for accurate creation of an emulated environment. While metadata are being used to support the preservation of digital objects, there is another equally important role it should be playing. It is not enough to preserve the object so it can be accessed and used in the future. What of the history and provenance of the digital object? What about search and retrieval functionality within the archive or repository the digital object is held in? One must consider how these preserved objects will be used in the future, and by whom. Preserving digital objects is difficult if adequate metadata is not present, especially if the item is outdated and no longer supported. Looking to the future, we should try to ensure metadata are processed correctly for the lifecycle of the digital object. This means care must be taken at the time of creation and curation of any digital objects because although some metadata are typically generated automatically, many elements that will play a pivotal role later must be created manually. Digital objects also commonly go through many changes, which is something that must be captured, as the change history will reveal what has happened to the object over of its lifecycle. The changes may include how the object has been modified, migrations to different formats, and what software created or changed the object—all of which is considered when emulating an appropriate environment. Examples of these changes can be found in case studies presented in the paper. METADATA TYPES The common and more widely used metadata types include, but are not restricted to, Administrative, Descriptive, Structural, Technical, Transformative, and Preservation metadata. Each metadata type describes a unique set of characteristics for digital objects. Administrative metadata include information on permissions as well as how and when an object was created. Transformative Metadata includes logs of events that have led to changes to a digital object. 4 Structural metadata describe the internal structure of an object and any relationships between components. Technical metadata describe the digital object with attributes such as height, weight, format, and other technical details. 5 Preservation metadata support digital preservation by maintaining authenticity, identity, renderability, understandability, and viability. They are not bound to any one category as they comprise multiple types of metadata, not including descriptive or contextual metadata. However, unlike the common metadata types, preservation metadata are unique from the other metadata types and are often ambiguous. 6 In 2012, the developers of version 2.2 of the PREMIS Data Dictionary for Preservation Metadata saw descriptive metadata as less crucial for preserving digital objects; however, they did state it was important for discovery and decision making. 7 While version 2.2 allowed descriptive INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2017 26 metadata to be handled externally through existing standards such as Dublin Core, the latest version (2017) of the dictionary allows for “Intellectual Entities” to be created within PREMIS that can capture descriptive metadata. 8 Thus, while digital preservation does not require all types of metadata, the absence of contextual metadata limits the future possibilities for the preserved object. Hart writes that because the multimedia objects are dynamic and interactive, and often composed of multiple image, audio, video, and software files, descriptive metadata are increasingly important because they can be used to describe, organise, and package the files. 9 It is also stressed that content description is of great importance because digital objects are not self-describing, which makes identifying semantic-level content difficult; without description metadata, context is lost. 10 For example, without description metadata to provide context, an image’s subject information and search and retrieval functionality is lost. Without this information, verifying whether an object is the original, a copy, or a fabricated or fraudulent item is impossible in most cases. Metadata Vulnerability—Case Studies Digital objects that are currently being created often go through several modifications, making it difficult to identify the original or authentic copy of the object. Verifying and validating authenticity is important for preserving, conserving, and archiving objects. The Digital Preservation Coalition defines authenticity as The digital material is what it purports to be. In the case of electronic records, it refers to the trustworthiness of the electronic record as a record. In the case of “born digital” and digitised materials, it refers to the fact that whatever is being cited is the same as it was when it was first created unless the accompanying metadata indicates any changes. Confidence in the authenticity of digital materials over time is particularly crucial owing to the ease with which alterations can be made. 11 Tests were undertaken to discover how vulnerable metadata can be in digital files that are subject to change, which can lead to loss, addition, and modification. The tests were conducted using the file types JPEG, PDF, and DOCX (Word 2007). The tests revealed what metadata can be extracted and what metadata could be present in the selected file types. Furthermore, they revealed how specific metadata can verify and validate the authenticity of a file such as an image. For each test, the metadata were extracted using ExifTool (http://owl.phy.queensu.ca/~phil/exiftool/). Alternative browser-based tools were tested and provided similar results; however, ExifTool was selected as the primary testing tool because it produced the best results and had the best functionality. Some of the files tested provided extensive sets of metadata that are too large to include, but subsets can be found in Hart (2009). Note that only subsets are included because some metadata was removed for privacy and relevance reasons. The process and method for each test was conducted in the following manner: http://owl.phy.queensu.ca/~phil/exiftool/ METADATA PROVENANCE AND VULNERABILITY | HART AND DE VRIES 27 https://doi.org/10.6017/ital.v36i4.10146 • Case study 1—JPEG o Original metadata extracted for comparison o Image copied, metadata extracted from copy and examined for changes o File uploaded to social media, downloaded from social media, extracted and examined against original • Case study 2—JPEG (modified) o Original metadata extracted for comparison o Image opened and modified in photo editing software (Adobe Photoshop), metadata extracted from new version and examined against original • Case study 3—PDF o Basic metadata extraction performed to establish what metadata are typically found in PDF files and what types of metadata could be possible • Case study 4—DOCX o Original metadata extracted for comparison o File saved as PDF through Microsoft Word and metadata compared to original o File converted to PDF through Adobe Acrobat and metadata compared to original Case Study 1 This case study investigated the everyday use of digital files, the first being simply copying a file. It was revealed that copying a file creates an exact copy of the original file and no changes in metadata aside from the creation and modification time/date. Thus, the copy could not be identified against the original unless the original creation time/date was known. The second everyday use was uploading an image to Facebook. The metadata-extraction tests revealed that the original file had approximately 265 metadata elements. (The approximation is caused by the ambiguity of certain elements that may be read as singular or multiple entries.) These elements included, but were not limited to, the following: • dates • technical metadata • creator/author information • color data • image attributes • creation-tool information • camera data • change • software history Many of the metadata elements had useful information for a range of situations. Even so, several metadata elements were missing that would require a user input for creation. Once the file had been uploaded to and then downloaded from social media, approximately 203 metadata elements were lost, included date, color, creation-tool information, camera data, change, and software history. It can be argued that removing some of this metadata would help keep user information private, but certain metadata should be retained, such as change and software history. These INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2017 28 metadata make it easier to differentiate fabricated images from authentic images and to know which modifications have been made to a file. For preservation purposes, the missing metadata is what may be needed to provide authenticity. This case study aims to make users aware of the significant risk of metadata loss when dealing with digital objects. If metadata are not identified and captured before the object is processed within a repository, the loss could be irreversible. Case Study 2 The second case study revealed how the change and software history metadata can be used to easily identify when a file has been modified. In the test conducted, it was evident by visually comparing the images that changes were made; however, modifications are not always obvious as some changes can be subtle, such as moving an element in the image that completely changes what the image is conveying. The following example displays the change history from the image used in case study 1, revealing how the metadata can easily identify modification: • History Action—saved, saved, saved, saved, converted, derived, saved • History When—The first saved was at 2010:02:11 21:59:05, the last saved was at 2010:02:11 22:12:01 with each action having its own timestamp • History Software Agent—Adobe Photoshop CS4 Windows for each action • History Parameters—Converted from TIFF to JPEG Further testing was conducted with simple photo manipulation using an original image to see firsthand the issues described in the initial test. The image contained approximately 178 metadata elements, including the typical metadata that were found in the first case study. Once the image was processed and modified with Adobe Photoshop CS5, the metadata were no longer identical. The modified image had approximately 201 metadata elements. The new elements included Photoshop-specific data, change, and software history. However, extensive camera data were lost. It can be argued that the camera data are not important for digital preservation because the lack of it will not hinder the preservation process. However, once the file is preserved and those data are lost, important technical and descriptive information can never be regained. For example, consider a spectacular digital image that captures an important moment in history. If that image is preserved for twenty years, in that time cameras and perhaps photography itself will have advanced dramatically. How digital images are captured and processed might be completely different and will most likely provide different results. Should someone wish to know how that preserved image was captured, they would need to know what camera was used, lens and shutter - speed data, lighting data, and other technical information. Preserving those metadata can be almost as important as preserving the file itself because each metadata element has importance and meaning to someone. As most viewers of online media are aware, photos are often modified, especially on social media. This is often performed on “selfies,” pictures taken of oneself. These can be modified to make the person in the photo look better or to hide features they see as flawed. Small modifications, such as covering some blemishes or improving the lighting have little effect on the image’s context, but some modifications and manipulations that can mislead people. These manipulated images often METADATA PROVENANCE AND VULNERABILITY | HART AND DE VRIES 29 https://doi.org/10.6017/ital.v36i4.10146 take the form of viral hoax images circulating around the web. For example, Figure 1 displays how two images can be combined into a composite image that changes the context of the image. Figure 1. Composite image. “Photo Tampering throughout History,” Fourandsix Technologies, 2003, http://pth.izitru.com/2003_04_00.html. The two images side by side are original photos taken in Basra of a British soldier gesturing to Iraqi civilians to take cover. In the right image, the Iraqi man is holding a child and seeking help from the solider; as you can see, this soldier does not interpret this as a hostile act. The image above is a composite of the two that changes the story. In this image, the soldier appears to be responding with hostility toward the man approaching. With basic photo manipulation, this soldier who is protecting innocent civilians is portrayed holding them against their will. Images like this circulate through media of all types, and although the exchangeable image file format (EXIF) metadata may not identify what has been done to the image, it would eliminate any doubt that the image has been modified. Unfortunately, these data are not made available. Making users aware of this vulnerability may improve detection of file manipulation at the time of ingest to better ensure only accurate and authentic material is being considered for preservation. Donations received by digital repositories such as libraries must be scrutinised by trained individuals. With this awareness and knowledge of metadata, they can perform their duties to a much higher standard. Case Study 3 The PDF metadata extraction provided interesting results. Over a range of tests on academic research papers, the main metadata identified consisted of PDF version, author, creator, creation date, modification date, and XMP (Adobe Extensible Metadata Platform) data. These metadata http://pth.izitru.com/2003_04_00.html INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2017 30 were not present in every PDF tested; in fact, the majority of PDF files seemed to be lacking important metadata. The author and creator fields were generally listed as “administrator” or “user” and bibliographic metadata was usually missing. However, PDF openly supports XMP embedding, therefore, bibliographic metadata could be embedded into the PDF. Through further testing, bibliographic metadata linked to the PDFs were discovered stored in online databases. Bibliographic software such as Endnote and Zotero allow metadata extraction, which enables users to import PDF files and automatically generate the appropriate bibliographic metadata. For example, Zotero performs this extraction by first searching for a match for the PDF on Google Scholar. If this search does not return a match, Zotero uses the embedded Digital Object Identifier (DOI) to perform the match. This method is not consistent: it often fails to retrieve any data, and in rare cases it retrieves the wrong data, which leads to incorrect references. Given what we saw happen to metadata when a file is uploaded such as in case study 1 and the nature of a PDF’s journey through template selection, editing, and publishing, it is no surprise that metadata are lost or diluted along the way. Case Study 4 The fourth case study conducted on DOCX files provided an extensive set of metadata, some of which are unique to this file type. Creating a new Word document via the File Explorer context menu and attempting to extract metadata resulted in an error as there were no readable metadata to extract until the file was accessed and saved. Once the file had some user input and was saved, the metadata were created and could be extracted. Microsoft Office files contain external XML files that holds information about the document, such as formatting data, user information, edit history, and information about the document’s page count, word count, etc. Picture a DOCX file as an uncompressed directory. However, using ExifTool on the DOCX file allowed retrieval of the metadata from all the hidden files. The metadata included creation, modification, and edit information, such as number of edits and total edit time. Every element within the document (e.g., text, images, tables, etc.) has its own metadata attached that are crucial for preserving the format of the document. The next step in the test involved converting the DOCX file into PDF using the following two methods: (1) converting the document via the “Publish” save option within Microsoft Word; and (2) “right clicking” the document and selecting the option to convert to an Adobe PDF. The results of the two methods varied slightly. Method 1 stripped all the metadata from the document and generated only default PDF metadata consisting of system metadata (file size, date, time, permissions) and the PDF version, author details, and document details. Method two behaved the same way except that some XMP metadata were created. Both methods resulted in no informative metadata remaining as the majority of the XMP elements were empty fields or contained generic values such as the computer name as the author. All formatting and metadata unique to Microsoft Word was lost. This case study is an enlightening example of what can happen to metadata when a file is changed from one format to another. METADATA PROVENANCE AND VULNERABILITY | HART AND DE VRIES 31 https://doi.org/10.6017/ital.v36i4.10146 HUMAN INTERVENTION The human element is a requirement in digital preservation as certain metadata, such as descriptive and administrative metadata, can only be created by humans. In fact, as Hart notes, user input is needed to record the majority of the digital preservation metadata. 12 The process can be tedious, as described by Wheatley. 13 One of the examples described included following the processes in a repository from ingest to access, beginning with the creation of metadata and the managerial tasks that are necessary. These tasks include using extraction tools and automation where possible. Using frameworks to record changes to metadata is required, and in some cases metadata must be stored externally to their digital objects. This allows multiple objects of the same type to utilise a generic set of metadata to avoid redundant data. However, although using a generic metadata set is convenient, a large collection of digital objects could be affected if the metadata is lost or damaged. The human element increases the risk of error drastically because there are numerous steps to metadata creation. Misconduct is also possible. Therefore, the less digital preservation is reliant on humans (and the easier the tasks are that require human input), the better. This can only be achieved by automating most process and training people to ensure they handle their responsibilities accurately, consistently, and completely. Learning the results from the case studies like those described in this paper will better prepare users working with digital objects. DISCUSSION To achieve the most authentic, consistent, and complete digital preservation, institutions must revise their preservation workflows and processes. This entails ensuring the initial processes within workflows are correct before processing digital content. The content must come from a credible source and have its authenticity approved. Participation from the donor of the digital content might be beneficial if they can provide information and metadata about the content. This information could provide additional context for the content as well as identify its history (e.g., format migration or modification). This is not always possible as the donor is not always be the creator of the digital content. If the original source is no longer available, as much information as possible should be gathered from the donor about the acquisition of the content and any information regarding the original source. This should be considered and carefully monitored throughout the lifecycle of digital content. Granted, if no changes are needed, devices such as write blockers can ensure this as they restrict users and any systems from making unwanted changes or “writes.” However, changes are sometimes unavoidable and (although it may not affect the content) detrimental. When changes are required, it is crucial to maintain the digital history by capturing all metadata added, removed, or modified during processing, commonly known as the “change history.” Donor participation should be stipulated in a donor agreement, something that each institution offers to all donors, sometimes in the form of agreements through communication and often with a structured document. Donor-agreement policies differ for each institution: some are quite detailed, allowing donors to carefully stipulate their conditions, whereas others place most of the INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2017 32 responsibility on the receiving institution. When dealing with sensitive or historic data of importance, policies should be in place to capture adequate data from the donor. When the content does not fall into this category, standard procedures, which should be present in all donor agreements and institution policies, can be followed. Institutions must also consider when to apply these steps as some transactions between donor and institution can follow standard protocol; others are more complex, such as donations of content with diverse provenance issues. CONCLUSION We have presented four case studies that illustrate how vulnerable digital-object metadata are. These examples show that common methods of handling files can cause irretrievable loss of important information. We discovered significant loss of metadata when uploading photos to social media and when converting a file to another format. The digital footprint left behind from photo manipulation was also exposed. We shed light on the bibliographic-metadata generation of PDF files, how they are obtained, and the surrounding issues. Action is needed to ensure proper metadata creation and preservation for born-digital objects. Librarians and Archivists must place a greater emphasis on why digital objects are preserved as well as how and when users may need to access them. Therefore, all types of metadata must be captured to allow users from all disciplines to take advantage of historical data in many years to come. Given the rate of technological change, we must be prepared; observing first-hand the vulnerability of metadata is a step toward a safer future for our digital history. REFERENCES 1 “Charter on the Preservation of Digital Heritage,” UNESCO, October 15, 2003, http://portal.unesco.org/en/ev.php- URL_ID=17721&URL_DO=DO_TOPIC&URL_SECTION=201.html. 2 K. Rechert et al., “bwFLA—A Functional Approach to Digital Preservation,” PIK—Praxis der Informationsverarbeitung und Kommunikation 35, no. 4 (2012), 259–67. 3 K. Rechert et al., Design and Development of an Emulation-Driven Access System for Reading Rooms, Archiving Conference, 2014, 126–31, Society for Imaging Science and Technology, 2014. 4 M. Phillips et al., The NDSA Levels of Digital Preservation: Explanation and Uses, Archiving Conference, 2013, 216–22, Society for Imaging Science and Technology, 2013. 5 “PREMIS: Preservation Metadata Maintenance Activity” Library of Congress, accessed March 10, 2016, http://www.loc.gov/standards/premis/. 6 R. Gartner and B. Lavoie, Preservation Metadata (2nd Edition) (York, UK: Digital Preservation Coalition, 2013), 5–6. http://portal.unesco.org/en/ev.php-URL_ID=17721&URL_DO=DO_TOPIC&URL_SECTION=201.html http://portal.unesco.org/en/ev.php-URL_ID=17721&URL_DO=DO_TOPIC&URL_SECTION=201.html http://www.loc.gov/standards/premis/ METADATA PROVENANCE AND VULNERABILITY | HART AND DE VRIES 33 https://doi.org/10.6017/ital.v36i4.10146 7 PREMIS Editorial Committee, PREMIS Data Dictionary for Preservation Metadata, Version 2.2 (Washington, DC: Library of Congress, 2012), http://www.loc.gov/standards/premis/v2/premis-2-2.pdf. 8 PREMIS Editorial Committee, PREMIS Schema, Version 3.0 (Washington, DC: Library of Congress, 2015), http://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf. 9 Timothy Hart, “Metadata Standard for Future Digital Preservation” (Honours thesis, Flinders University, Adelaide, Australia, 2015). 10 J. R. Smith and P. Schirling, “Metadata Standards Roundup,” IEEE MultiMedia 13, no 2 (April-June 2006): 84–88. 11 “Glossary,” Digital Preservation Coalition, accessed August 5, 2016, http://handbook.dpconline.org/glossary. 12 Timothy Hart, “Metadata Standard for Future Digital Preservation” (Honours thesis, Flinders University, Adelaide, Australia, 2015). 13 Paul Wheatley, “Institutional Repositories in the Context of Digital Preservation,” Microform & Digitization Review 33, no. 3 (2004): 135–46. http://www.loc.gov/standards/premis/v2/premis-2-2.pdf http://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf http://handbook.dpconline.org/glossary ABSTRACT INTRODUCTION METADATA TYPES Metadata Vulnerability—Case Studies Case Study 1 Case Study 2 Case Study 3 Case Study 4 HUMAN INTERVENTION DISCUSSION CONCLUSION REFERENCES