Issues in Science and Technology Librarianship | Spring 2015 |
|||
DOI:10.5062/F42805MM |
URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed. |
Chung-Yi Hou
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
Champaign, Illinois
hou@illinois.edu
With the proliferation of digital technologies, scientists are exploring various methods for the integration of data to produce scientific discoveries. To maximize the potential of data for science advancement, proper stewardship must be provided to ensure data integrity and usability both for the short- and the long-term. In order to assist scientists and their library and information partners to gain familiarity, skills, and knowledge of data management, the Federation of Earth Science Information Partners (ESIP) has developed the Data Management for Scientists Short Course. The Short Course provides training on a range of data management topics, including metadata, file formats, and interoperable content. This paper provides an overview of the background and motivation for data management, followed by an introduction to the Short Course. The discussion on the Short Course' features demonstrates the advantages of learning the fundamentals of data management through the Short Course. The paper concludes by emphasizing the importance of familiarity with data management practices for both researchers and their library and information partners.
The improvement of digital technologies initiated the era of eScience, denoted by "the use of digital technology for solving scientific problems" (Greenberg et al. 2009). The resulting data proliferation is manifested in the number of data sources available for researchers to use, as well as in the diversity of data types and formats. The EMC Digital Universe study (Gantz & Reinsel 2012), estimates that, between 2005 and 2020, the quantity of digital data created, replicated, and consumed will grow by a factor of 300, increasing from 130 exabytes (1018) to 40,000 exabytes. The constantly expanding volume of available data allows researchers to conceive of and participate in new research opportunities. For example, in climate science, the act of combining different datasets can produce high-resolution climate model datasets with improved details (Rife et al. 2014). In the biodiversity field, data integration likewise facilitates the investigation and visualization of human-environmental interactions and the resulting impacts (Del Rio et al. 2013; Spretke et al. 2011).
With the increase in digital data availability and the need to use these resources for scientific research, the expectation of easy and effective access is rising as well. The concept of open access to data, which encourages the dissemination and use of information by a wide community in a cost-effective manner, has been particularly promoted within the U.S. government for over two decades. The concept stems from the Circular No. A-130: Management of Federal Information Resources (Office of Management and Budget (US) 2000), which stipulates that with few exceptions (e.g., national security and privacy), government data should be openly available at no more than the cost of dissemination. In addition, data created through public funding are considered public goods (Jaeger et al. 2011). As a result, the concept of open access to information is broadly recognized by library and information science (LIS) organizations (Case & Matz 2003). Increasingly, as society benefits economically from new scientific developments (High Level Expert Group on Scientific Data 2010), we have gained more appreciation for data as national assets.
In 2013, these issues became more visible through the Open Data, Open Science, and Open Government initiatives of the White House Office of Science and Technology Policy (OSTP) ({https://web.archive.org/web/20161018042203/https://www.whitehouse.gov/administration/eop/ostp/initiatives}). Specifically, the "Open Data" element of the initiative aims to promote innovation and economic growth through public access and use of federal data. Equally important, the "Open Government" element intends to promote accountability for government actions through emphasizing data transparency, reproducibility, and collaboration.
Access to large volumes of available and usable data, however, satisfies only part of the criteria required for scientific research. Many additional steps are required to be taken in order to fully realize the value of scientific data. Establishing data management practices is particularly crucial as the sheer quantity of data created and required by a researcher can be overwhelming. In addition, many research topics require diverse data types and formats from multiple sources and disciplines, as well as long-term data for assessment. It is imperative that through data management practices, data be structured and organized clearly and systematically, both on the technical and the content level, so that data can be accessed, understood, and used without causing researchers to develop "information pathologies," such as information overload, information anxiety, and even information avoidance (Bawden & Robinson 2009).
To support Open Science, several federal agencies and research organizations, such as the National Science Foundation (NSF), National Oceanic and Atmospheric Administration (NOAA), and U.S. Geological Survey (USGS), have begun to promote data sharing policies. Many of these organizations have also started to develop data management requirements and guidelines targeted for their research communities. Consequently, to help researchers who generally have received little or no data management training, multiple organizations are developing training and educational resources relating to data management. All of these resources are designed to enhance researchers' data management knowledge-base and skills, in addition to promoting information sharing with broader communities. Examples of these resources include:
One might consider it fairly straightforward to select and study the appropriate resources and to take data management actions accordingly. However, researchers often lack the required time and familiarity for identifying relevant materials to meet their data management needs. Researchers also frequently cannot fulfill all data management tasks on their own, such as writing and completing data management plans. This is a significant concern as data management tasks should be undertaken during the scientific process rather than only at the end of the project. As a result, researchers often need to partner with other LIS professionals, such as data managers and librarians, to assist with data management tasks and to facilitate ongoing collaborations.
Although researchers, data managers, and librarians are all important stakeholders of data management, they have different understanding and interest levels. Condensing the diverse and distributed educational resources into a single, available digest permits researchers, and often more importantly, the LIS professionals who assist them, to obtain a common, up-to-date knowledge of data management as well as the related skills without becoming intimidated or confused during the process.
In order to provide centralized locations where data management resources are organized and presented for effective education and training purposes, many libraries, information organizations, and federal agencies, such as Data Information Literacy, DataONE, and USGS, have recently created data management training modules. These different organizations and agencies are typically geared towards specific audiences and user needs. Acknowledging the need for concise, time-efficient data management resources applicable to diverse users, the Federation of Earth Science Information Partners (ESIP) generated their training modules to promote a broad range of basic data management knowledge and skills.
ESIP is a "broad-based community drawn from agencies and individuals who collectively provide end-to-end handling for Earth and environmental science data and information" (ESIP Federation 2011) in order to support data needs of both the ESIP members and the wider Earth science community. By facilitating collaborations among agencies and individuals with broad knowledge, skill sets, and experience levels, ESIP's vision is to "be a leader in promoting the collection, stewardship and use of Earth Science data, information and knowledge that is responsive to societal needs" (ESIP Federation 2011).
Understanding the need for basic yet broad perspective training resources for Earth science data management, ESIP partnered with NOAA and the Data Conservancy to develop the "Data Management Short Course for Scientists" (hereafter called "Short Course"). From 2011-2013, members of the ESIP Data Stewardship Committee produced 35 course modules. Twelve individuals, representing federal agencies, academic institutions, information organizations, and data centers, contributed as module authors. The Short Course's modules were also peer-reviewed and edited by scientific researchers, data stewards, and library professionals. In addition, the Short Course has been presented to audiences at multiple national and international venues for the elicitation of feedback. Detailed reviews of the course's usage and content are planned, so that the ESIP Data Stewardship Committee can continue to maintain and update the modules to benefit researchers, and data/information professionals, including librarians.
The Short Course can be accessed free of charge through two methods: 1) ESIP Commons (http://commons.esipfed.org/datamanagementshortcourse) and 2) ESIPFED Vimeo (http://vimeo.com/album/2142831).
From the ESIP Commons page, the 35 modules from the Short Course are organized into the following major sections:
The full listing of the modules in alphabetical order by module name can be found in Appendix A.
Within ESIP Commons, each module has a separate landing page. The landing page contains an overview of the module, its digital object identifier (DOI), a link for downloading the presentation slides, and the ability to stream a video of the training module directly from the page. The video associated with each module lasts about five to 15 minutes. Although the modules were developed as part of a complete syllabus, each represents a stand-alone topic. In addition, the modules are designed to be self-pacing and flexible, so that users can review and stream or download specific modules as needed. This enables the Short Course to be customizable to each individual's data management knowledge, skill level, and time constraints. As the Short Course's materials are openly available under a Creative Commons attribution license, they can be repurposed as well with proper credit ascribed.
The Short Course presents a wide range of topics pertaining to specific data management tasks, such as "Agency Requirements," "Creating Documentation and Metadata," and "Elements of a Data Management Plan." Related topics, including "Enhancing Your Reputation," "Using Self-Describing Data Formats," and "Copyright and Data," highlight the importance of the social, technical, and legal aspects of data management. As such, the Short Course also assists its users in developing a deeper perspective of the full sphere of data management. Lastly, regular updates are planned with the support of the ESIP Data Stewardship Committee, allowing users to return to the modules periodically to revisit particular topics and locate new modules. Figure 1 and 2 show samples of key contents presented by the "Elements of a Data Management Plan" and "Enhance Your Reputation" modules (last accessed 21 May 2015).
Figure 1. Sample of Key Content Presented in the "Elements of a Data Management Plan" Module
Figure 2. Sample of Key Content Presented in the "Enhance Your Reputation" Module
Although the Short Course was designed for scientists, the modules are intended to support both scientific researchers and the data/information professionals who assist and collaborate with them. As data management practices and expectations continue to evolve, LIS professionals will continue to be important partners for researchers. In the academic environment, data management is a growing challenge. Librarians who are already involved and experienced with knowledge management can provide vital data support for researchers. The Short Course can thus serve librarians in two ways: 1) for their own education, and 2) as resources to recommend to their graduate students, faculty, and staff.
The rapid increase in the availability and diversity of data will likely continue as digital technologies improve. Although the "data deluge" will persist, value can be derived from data through proper and continuous data management. To facilitate better data sharing and reuse among the wider community, including academic, commercial, and public sectors, it is imperative for scientific researchers to familiarize themselves with data management requirements and to acquire the necessary skills. Concurrently, it is also crucial to emphasize that LIS professionals are important data management partners. Together, researchers and librarians exchange knowledge and share resources in order to grow and foster data management skills and expertise. While various data management training modules are available, ESIP's Data Management Short Course for Scientists fills a niche by drawing the attention of scientific researchers and by maintaining a syllabus with comprehensive and fundamental training topics. The modules are designed to be short and focused to enable a personalized training sequence. The modules are equally applicable as training and reference resources for librarians. When used by both researchers and LIS professionals, ESIP's Short Course can help establish common ground and shared knowledge for communicating and collaborating on data management tasks. In return, as experience and best practices in data management continue to mature, data management training will allow researchers and librarians to help fulfill data's full potential through growing scientific knowledge.
The author gratefully acknowledges the help and support of Ruth Duerr (National Snow and Ice Data Center), Nancy Hoebelheinrich (Knowledge Motifs, LLC), Justin Goldstein (U.S. Global Change Research Program and the University Corporation for Atmospheric Research), Matthew Mayernik (National Center for Atmospheric Research), and the ESIP Data Stewardship Committee.
Bawden, D. & Robinson, L. 2009. The dark side of information: Overload, anxiety and other paradoxes and pathologies. Journal of Information Science [Internet]. [Cited 2015 January 26]; 35(2):180-191. DOI: 10.1177/0165551508095781
Case, M.M. & Matz, J. 2003. Framing the issue: Open access. ARL: A bimonthly report on research library issues and actions from ARL, CNI, and SPARC, 226:8-11.
Costello, M.J. & Wieczorek, J. 2013. Best practice for biodiversity data management and publication. Biological Conservation [Internet]. [Cited 2015 January 26]; 173(2014):68-73. DOI: doi:10.1016/j.biocon.2013.10.018
Del Rio, N., Villanueva-Rosales, N., Pennington, D., Benedict, K., Stewart, A., & Grady, C.J. 2013. ELSEWeb meets SADI: Supporting data-to-model integration for biodiversity forecasting, Discovery Informatics: AI Takes a Science-Centered View on Big Data AAAI Technical Report, FS-13-01:8-15.
ESIP Federation. 2011. [Internet]. Vision; [Cited 2015 January 26]. Available from: http://www.esipfed.org/history
Gantz, J. & Reinsel D. [Internet]. [2012]. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the Far East; [Cited 2015 January 26]. Available from: http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf
Greenberg, J., White, H., Carrier, C. & Scherle, R. 2009. A metadata best practice for a scientific data repository. Journal of Library Metadata [Internet]. [Cited 2015 January 26]; 9(3-4):194-212. DOI: 10.1080/19386380903405090
Halbert, M., Moen, W., & Keralis, S. 2012. The DataRes research project on data management. Proceedings of the 2012 iConference; 2012 February 7-10; Toronto. New York (NY): ACM. p. 589-591. DOI: 10.1145/2132176.2132300
High Level Expert Group on Scientific Data. [Internet]. [2010]. Riding the wave – How Europe can gain from the rising tide of scientific data – Final report of the High Level Expert Group on scientific data – A submission to the European Commission; [Cited 2015 January 26]. Available from: http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf
Jaeger, P., Bertot, J., Kodama, C., Katz, S., & DeCoster, E. 2011. Describing and measuring the value of public libraries: The growth of the Internet and the evolution of library value. First Monday [Internet]. [Cited 2015 January 26]; 16(11). DOI: 10.5210/fm.v16i11.3765
Office of Management and Budget (US). 2000. Circular No. A-130: Management of Federal Information Resources. Memorandum for heads of executive departments and agencies [Internet]. Washington (DC): Office of Management and Budget (US); [cited 2015 January 26]. Available from {v}
Office of Science and Technology Policy (US). [date unknown]. OSTP Initiatives [Internet]. Washington (DC): Office of Science and Technology Policy (US); [cited 2015 January 26]. Available from: {https://web.archive.org/web/20161018042203/https://www.whitehouse.gov/administration/eop/ostp/initiatives}
Rife, D.L., Pinto, J.O., Monaghan, A.J., Davis, C.A., & Hannan, J.R. [Internet]. [updated 2014]. NCAR Global Climate Four-Dimensional Data Assimilation (CFDDA) hourly 40 km reanalysis; [Cited 2015 January 26]. DOI: 10.5065/D6M32STK
Spretke, D., Janetzko, H., Mansmann, F., Bak, P., Kranstauber, B., & Davidson, S. 2011. Exploration through enrichment: A visual analytics approach for animal movement. Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems; 2011 November 1-4; Chicago. New York (NY): ACM. p. 421-424. DOI: 10.1145/2093973.2094038
(The list is also available at: http://commons.esipfed.org/datamanagementshortcourse)
This work is licensed under a Creative Commons Attribution 4.0 International License.