URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed.

Organizing, Contextualizing, and Storing Legacy Research Data: A Case Study of Data Management for Librarians

Brianna Marshall
bhmarsha@indiana.edu
Katherine O'Bryan
Na Qin
Rebecca Vernon

Department of Information and Library Science
School of Informatics and Computing
Indiana University
Bloomington, Indiana

Abstract

Librarians are increasingly expected to work with researchers to organize and store large amounts of data. In this case study, data management novices undertake responsibility for a legacy public health research dataset. The steps taken to understand and manage the legacy dataset are explained. As a result of the legacy dataset experience, the authors of this study identified three main issues to resolve during a data management project: file organization, contextualizing data, and storage and access platforms. Finally, recommendations are made to help librarians working with legacy data identify solutions to these problems.

Introduction

Librarians have long been considered information caretakers. In the modern world, the information being collected and preserved is often vast amounts of electronic data. To continue to fulfill this need and assist researchers with their data, libraries have begun implementing data management services. Often, librarians without formal data management training are being asked by administrators to take on data management responsibilities.

This article describes a case study in data management. A data management librarian asked the authors, a group of Indiana University Department of Information and Library Science (DILS) graduate students, to manage a legacy dataset. The dataset spanned several decades, contained 856 data files, and included myriad file formats. (In order to observe all ethical practices, we will refer to the researcher who created the data as Dr. Smith.) Based on our examination of the data, we provide recommendations for librarians who have no formal data management training and possess few data management resources.

During our data examination, we needed to agree upon terminology. The terms management, curation, preservation, and stewardship are often used interchangeably when referring to data. For the purpose of this paper, we decided to use the term management to refer to the acts of organizing, providing context, and determining storage and access for our dataset.

When we studied the current literature on data management in libraries, we found the journal articles generally informative. However, much of the information is theoretical and does not provide guidelines for librarians who do not have formal data management training. There is a need for basic guidelines these librarians can follow, including simple recommendations for organizing files, providing context to the files, and determining storage and access platforms. These are the main obstacles faced by librarians and the aspects we have chosen to focus on in this case study.

Current literature outlines many reasons for library involvement in research data management. Library science skills and subject expertise help librarians to address data management needs, including identifying electronic data issues, finding tools to address those issues, choosing proper data formats and metadata schemas, and selecting data transfer and storage protocols in and across subject areas. Librarians regularly collaborate across disciplines, which is important since communication with technology and subject experts is common in data management projects (Bardyn et al. 2012; Ferguson 2012; Garritano & Carlson 2009; Latham 2012). Furthermore, libraries should supervise data management because they have the capacity to manage diverse data types. A large volume of the researchers' data is not properly managed; the researchers need libraries' guidance. Ideally, data management activities will take place through observation and participation within the data's researcher communities from the beginning of a project (Heidorn 2011).

Librarian training opportunities for data management exist and include professional development in the form of workshops and webinars; continuing education certificates offered by library schools; and self-directed training through many freely available online resources. (For examples of these resources, please see Library Training for Data Management and General Best Practices/Records Management in the Appendix.) Formalized education in data management is just beginning to be added to Master's-level library science programs. Because it is not yet widely taught, the lack of new graduates with data management training can be problematic for libraries as they seek someone with this skill set.

Case Study

Data Description

At the beginning of this project, we received a dataset created during the 1970s, 1980s, and early 1990s by Dr. Smith, who gathered and worked with public health data. The body of work included:

Two descriptive documents,
15 folders, each containing between two and 495 files,
and 856 data files total.

Files were originally created and stored on mainframe computers and Unix systems, many needing the common statistical software SPSS to run. The two descriptive documents included in the files attempt to provide enough information for other users to navigate the dataset. One document is a Microsoft Word file created by Dr. Smith to explain the names of the folders. The second document is a ReadMe text file with information gathered during the data management librarian's interview with Dr. Smith. This file gives context and explains naming conventions and related documentation, including the survey instruments used to gather the data and articles published on the data.

In general, three major file types (data, input, and output) comprise the dataset. The data files contain coded responses to questionnaires designed by Dr. Smith, which appear to the viewer only as long series of numbers with no contextual information. Input files are a series of commands that tell the software how to mine the data. The input files then generate output files. Output and print out files have tables and charts used by Dr. Smith to present data in published articles. Copy files can be any one of the three major types of files. The file extensions found within the dataset can be split between the major file types as follows:

Data: .dat, .txt, .0000, .0001, .ask, .def, .mick, .fix, some no extensions
Input: .sps, .george, .orig, .fix, some no extensions
Output/Print Out: .out, .inf, .put, .tab, some no extensions
Copies: .old, .save, .fixed, .back
Plain Text/Other: .info, .com, .jnl, some no extensions

We ran into several problems when first assessing the dataset, including:

Software compatibility: The files can be opened and reviewed with the help of programs like TextEdit or Notepad. However, the files were created and used during the 1970s, 1980s, and early 1990s. Dr. Smith created these files with outdated versions of SPSS in a Unix computing environment. Newer computers, which run more recent versions of SPSS, may have difficulty working with the files.

General lack of knowledge of old software and systems: We had only a basic working knowledge of the Unix operating system and no knowledge of SPSS. Dr. Smith's knowledge when creating these files was much more extensive. This made the process of identifying types and purposes of files more difficult.

Understanding context: Data files contain series of numbers with no explanation about what these numbers mean. Without the original questionnaires or the background knowledge of the study, data is not understandable and therefore unusable.

Naming complexity: Decoding the naming conventions for files would take time and access to other documents or an additional interview with Dr. Smith. Letters in the file name can potentially refer to locations, classes, collaborating researchers, topics, when the study was performed, the type of software program, and other specific uses for files. Numbers prior to the file extension are intentional; often referring to the year(s) data was gathered, to a class name, or a combination of the two. Without knowing where, when, why, or with whom Dr. Smith worked, and who completed the questionnaires, it will be difficult to ascertain exactly why files have certain names.

Difficulty understanding language within files: The language in input files is supposed to be understood by the software. Dr. Smith, who created the information within the files, had extensive knowledge of the language. Without detailed research into not only SPSS in general but also the many older versions of the software, we would not be able to understand what files are meant to do. Dr. Smith's own language, which includes abbreviations for data categories in tables, is difficult to understand without a provided glossary.

Identifying Desiderata: Not knowing what the language means within the files also makes it difficult to discover what files are excess. Some files may be corrupted, contain uncorrected errors, be incomplete, be blank, or not have file extensions. Decisions as to what should be saved can only be made with knowledge of the language or the software.

After looking through the data and encountering problems, we needed to do more research about the software programs and files. While we learned a lot about Dr. Smith's research by examining the dataset and consulting her web site, we needed more assistance. We were lucky enough to have a well-established IT department on-campus, including a specialized statistics and math IT center. The staff members at this center were able to provide solutions to many of our identified problems, as well as potential future problems. The questions we asked included:

Exactly how do the various software programs work?
What would the software costs be for future researchers or the data management program?
Are there any helpful resources explaining the software language or providing online tech support?
Do they have any specific recommendations about future file formats and the best software options?
Would we run into any compatibility issues between the old files and the new software?

With these considerations in mind, we focused our efforts on three baseline practices: organizing the data, providing context for the data (if possible), and identifying storage and access options for the future.

Organization & Providing Context

Dr. Smith's articles, especially the sections about methodology, and supporting documents can provide context for understanding the dataset's collection and analysis. The questionnaires and articles that accompanied our dataset informed the creation of basic metadata. Using a metadata standard such as Dublin Core or METS is advisable since these are widely used and compliant with many digital repositories. The Digital Curation Centre provides additional guidance in selecting a metadata schema. (You can find more information in our Appendix under Organization & Providing Context.) There are several possible formats for storing metadata, but CSV or XML are preferable due to their extensibility.

After we developed an understanding of our data, we asked ourselves whether we should keep all the data--and if we could not keep the data, how could we decide which data to keep? We considered several factors in answering these questions, focusing on who would be using and accessing this data in the future. Because we wanted to open the dataset to all users, we decided to keep the complete dataset. While cost is an important issue, institutions should proceed cautiously when deciding whether or not to remove data. Beyond wanting to open our data to all users, we chose to err on the side of caution and keep our dataset intact because we did not feel that we had enough information to delete files from the dataset.

We originally utilized the information contained within the dataset's master ReadMe file to try creating a new naming convention for the folders and files. Given the brief time frame of our project, we selected a sample set of our data to organize.

All files within a folder have unique relationships. Data files are only understandable when linked to questionnaires. SPSS files are connected to individual data files. Output files are the direct result of a specific SPSS file. We decided including this contextual information in our file naming convention would be helpful. Future researchers would be able to quickly and easily find all the files connected with a specific document or study. We wanted to ensure that if the files were accessed individually without the context that the folder provided, researchers would still have the best possible understanding of the file. In its current state, if files were taken from the folder there would be almost no way to tell where they originated. Including the folder name in the file name would ensure this situation could be avoided.

As a result, we decided to add the folder name to each file name, use the original file name, and the source documents associated with each file. This was our final file naming convention:

folderName_originalName_sourceName_sourceFormat [questionnaire or data file]

Examples:	IndianaUniversityStudy2001_slisclass_PHQ_questionnaire
	IndianaUniversityStudy2001_sliswrit_slisclass_data

While we were pleased with our final naming convention, we realized our attempt to provide context created a major issue. All SPSS files link back to another data file. To do this, the syntax within the files actually states the full name of the data files. Changing the names of data files would make it impossible for SPSS files to run properly because the connection between the two would be lost. Modifying the real data file name would not affect the name listed in the SPSS file. Ultimately, this problem was too significant to ignore or explain in a ReadMe file. As a result, we had to abandon the idea of renaming files and consider our other option.

Instead of renaming files, it is just as effective to provide context with a more extensive explanatory document. We decided to use the Data Curation Profiles (DCP) Toolkit, a freely available tool developed by the Purdue University Libraries and the Graduate School of Library and Information Science at the University of Illinois Urbana-Champaign, to help us structure a follow-up interview with Dr. Smith. DCPs are documents that are intended to be completed jointly by a data manager and researcher. They contain 13 modules, several of which can be modified to fit the particular dataset, which act as prompts for the purpose of collecting additional metadata and contextual information about the dataset.

We felt that a DCP could capture the context of this data in a more in-depth way than the available ReadMe file (see Appendix). While it would require a follow-up interview, it solved many of the problems we faced. We appreciated the easy-to-follow format of the profiles, as well as the fact that we could use a tool that was already available and being used widely by other institutions.

Storage & Access

Next we needed to determine how to securely store our dataset so that it could be accessed by all researchers and interested parties in the future. The main consideration with storing the dataset is whether it is secure and backed up. We also chose to prioritize storage options that provide access options for users, not just preservation. Our university has an institutional repository, so this would be the option we would select for our dataset. However, we wanted to consider the question of storage and access hypothetically to provide ideas for other librarians.

An institutional repository would be the best place to securely store, preserve, and provide access to the dataset. Another option that provides storage, preservation, and accessibility would be a domain repository, though some have costs associated with use. The indexes DataBib and Re3data allow users to search and find worldwide domain repositories. (You can find the URLs for these indexes in the Storage & Access section of the Appendix.)

If an institutional or domain repository is not an option, storing the data on institutional server space is another possibility. This is preferable to storing data on cloud storage, an external hard drive, or another comparable local storage solution. Ideally these methods would only be used to back up the dataset, since they provide only preservation without access options for users.

Regardless of the initial vehicle for storing your data, the data should be backed up following the basic guidelines provided by the MIT Data Management Subject Guide. (You can find this URL in the Storage & Access section of the Appendix.)

After determining a storage platform that handles data preservation, an important consideration is to determine how users will be able to access the data. We were working with public health data, which introduced special considerations. While there is no national standard for managing sensitive data, public health data may involve private health information. It is important to keep human subject implications in mind prior to making any data available. The researcher who created the data would be best able to determine whether there are any human subject implications, but if they are deceased or otherwise unavailable, consult your institution's research ethics committee or Institutional Review Board (IRB).

If there are not human subject implications and the researcher is willing and able to make the data freely available, the library should make it a priority to open the data to all users. Assigning an open access license of the researcher's choice should not be a problem in institutional or domain repositories; be sure that your repository allows the researcher to retain control over licensing the data. If you do not have access to a repository, you can recommend the researcher use a third-party service that provides access and preservation, such as Figshare. (For more information about Figshare, please see our Appendix under Storage & Access.)

Recommendations

Based upon our experience managing this dataset, we suggest that others follow the guidelines listed.

Education: To get a feel for the data management landscape, take advantage of the many freely available guides and workshops on data management best practices. MIT's Data Management Guide is very accessible for individuals new to data management.

Documentation: As a standard for managing legacy data, fill out a Data Curation Profile with the researcher responsible for the dataset (if possible). Gather as much contextual information about the dataset as possible. The profile will not only help you and your colleagues in making decisions about the data, it will also enable future users to understand the data.

Leverage Existing Resources: Utilize your institution's IT department. You will likely need to consult with them to determine storage options or to understand file formats better. This can be a great resource and we encourage you to reach out to them.

Licensing and Access: Make the dataset open access (if possible). If the data is not freely accessible, what is the point of preserving it?

Sustainable Policies: Based upon the experience you have managing the data, it may be useful to develop a simple policy that indicates what data your institution will accept and in what form. This can represent one of the first steps toward developing a data management infrastructure at your institution.

Conclusions

While data management can be a complex and overwhelming subject, there are simple steps that librarians can take to manage a dataset similar to ours. By sharing our experience with organizing, giving context, and determining storage and access options for our dataset, we have created general guidelines for managing legacy public health research data. Additional case studies of this nature would be beneficial to librarians as the profession continues to define its role in data management work. These case studies would provide even more help for the librarian with data management responsibilities, a role that will increase in importance as data continues to be generated in larger quantities. In addition, these case studies could inform data management training in the library school curriculum, defining which skills are the most important for librarians with data responsibilities.

Acknowledgements

The authors would like to thank Brian Winterman and Stacy Konkiel for their support and guidance during the writing of this paper.

References

Bardyn, T.P., Resnick, T., & Camina, S.K. 2012. Translational researchers' perceptions of data management practices and data curation needs: findings from a focus group in an academic health sciences library. Journal Of Web Librarianship, 6(4), 274-287.

Ferguson, J. 2012. Description and annotation of biomedical data sets. Journal of eScience Librarianship 1(1), 51-56.

Garritano, J.R. & Carlson, J.R. 2009. A subject librarian's guide to collaborating on e-science projects. Issues in Science and Technology Librarianship [Internet]. [Cited 2013 June 11]; 57. Available from http://www.istl.org/09-spring/refereed2.html

Heidorn, P.B. 2011. The emerging role of libraries in data curation and e-science. Journal of Library Administration 51(7-8), 662-672.

Latham, B. & Poe, J. 2012. The library as partner in university data curation: A case study in collaboration. Journal Of Web Librarianship, 6(4), 288-304.

Appendix

Library Training for Data Management

Professional development – workshops, webinars
Continuing education
- University of Arizona – DigIn Graduate Certificate
- University of Illinois-Urbana Champaign – Data Curation Specialization
- University of North Carolina – Data Curation Post-Master's Certificate
DIY Training Kit for Librarians - http://datalib.edina.ac.uk/mantra/libtraining.html
University of Minnesota Data Management Course Modules - https://sites.google.com/a/umn.edu/data-management-course_structures

General Best Practices/Records Management

Primer on Data Management: What You Always Wanted to Know - http://www.dataone.org/sites/all/documents/DataONE_BP_Primer_020212.pdf
Data Management and Publishing – MIT Subject Guide - {http://libraries.mit.edu/data-management/}

Available Tools

DMPTool - https://dmp.cdlib.org/. This tool helps researchers and librarians create a data management plan.
DataUP - http://dataup.cdlib.org/. This is billed as a tool that "facilitates documenting, managing, and archiving tabular data."
Data Curation Profiles Toolkit - http://datacurationprofiles.org/. This toolkit helps guide discussions between librarians and faculty with the goal of planning data support services for faculty needs.

Organization & Providing Context

Digital Curation Centre - Using Metadata Standards - http://www.dcc.ac.uk/resources/briefing-papers/standards-watch-papers/using-metadata-standards/
MIT Data Management Subject Guide - Organizing Your Files {http://libraries.mit.edu/data-management/store/organize/}

Storage & Access

Indexes for locating domain repositories worldwide:
- DataBib http://www.databib.org
- Re3data http://www.re3data.org/
Figshare - http://www.figshare.org/. A third-party service that provides data access and preservation.
MIT Data Mangement Subject Guide - Backups and Security - {http://libraries.mit.edu/data-management/store/backups/}

Previous	Contents		Next
Issues in Science and Technology Librarianship		Fall 2013
DOI:10.5062/F4K07270