URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed.

Electronic Scientific Data & Literature Aggregation: A Review for Librarians

Barbara Losoff
Assistant Professor & Science Librarian
Norlin Library
University of Colorado at Boulder
Boulder, Colorado
Barbara.losoff@colorado.edu

Abstract

The advent of large-scale digital repositories, along with the need for sharing useful data world-wide, demands change to the current information structure. The merging of digital scientific data with scholarly literature has the potential to fulfill the Semantic Web design principles. This paper will identify factors leading to integration of databases and journal literature; discuss the visions of the merged format proposed by scientists; and librarians' role in this transformation.

Introduction

Historically, scientific databases and scholarly articles have operated as discrete units with databases providing "background" to the journals "foreground." As electronic access to data has become more common, the once sharp distinction between journal and database has begun to blur. This blurring has become so extreme that now the data has become more valuable than the published paper. Data sets from the Human Genome Project "have more value than any single publication that was derived from an analysis of them" (Witt in Carlson 2008). This "merging of databases and literature indicates an approach to one of the major scientific challenges of the next decade to extend the capabilities of science literature" (Gerstein & Junker 2001). Integration of digital scientific data with scholarly literature has the potential to actualize the Semantic Web design principles envisioned by Tim Berners-Lee, director of the World Wide Web Consortium, creating a universal medium for data, information, and knowledge exchange. A fundamental feature of the Semantic Web is the deployment of a common standard, the Resource Description Framework (RDF), a web-based "Semantic Tool" for encoding knowledge, "permitting web sites to publish information as machine-readable, processable, and in integrated forms" (Tauberer 2006). The use of RDF in conjunction with web Ontology Language (OWL), an agreed upon, published conceptualization of content (De Roure et al. 2003), has resulted in software systems capable of exploring relationships between data and electronic literature across multiple platforms.

A discipline-specific example of the Semantic Web in action is the SemanticEye, a software product that pairs Semantic design standards with chemistry. The SemanticEye offers "a model for rectifying electronic journal shortcomings by adapting the digital music semantic model to chemical electronic publishing" (Casher & Rzepa 2006). The goals of the SemanticEye are both the defragmentation of chemistry electronic journals and the expansion of chemistry content on the web beyond published articles (Casher & Rzepa 2006). By introducing new metadata objects, such as InChI (IUPAC International Chemical Identifier) which converts molecular structures into text, the SemanticEye provides the capability of mining electronic publications as well as the supporting data.

This trend toward knowledge aggregation necessitates a critical re-examination of librarians' role in relation to data curation; "At the most fundamental level, engaging the library profession in the problem of data management may lead to reframing the values and practices of the library profession" (Brandt 2007). Librarians, in collaboration with scientists, are in a unique position to 'bridge the divide,' applying Semantic design criteria to harness the power of data and journal literature at the institutional level.

Scientific Progress and Merged Format

A review of the literature indicates that the integration of databases with journal literature and other research-related resources is an important component in furthering scientific progress. Neumann et al. (2004) expressed the need for intelligent and searchable integration "Scientific publications and curated databases together hold a vast amount of actionable knowledge. However, their full value is realized only in the context of such resources being connected together by meaning, such that machine processes can traverse and identify these links intelligently."

The ability to mine data and derive meaning, expands the possibilities for discovery as "Scientific progress increasingly depends on pooling know-how and results: making connections between ideas, people, and data: and finding and reusing knowledge and resources generated by others in perhaps unintended ways" (Goble et al. 2006). Chris Greer, a cyber infrastructure advisor for the National Science Foundation (NSF), echoed this potential as "People find ways to use this information (digital data) that the original researchers didn't think of" (Greer in Carlson 2006). Merging digital content and format in a manner consistent with Semantic Web criteria, has the potential to eliminate barriers regarding disciplines and geographic locations. Rhoten (2007) equated the ability to access data intelligently from around the globe as serendipitous "as the chance encounter in the hallway between researchers apparently without much in common that yields a revolutionary breakthrough" and that "being virtual can actually surpass being there."

Factors Driving Content Integration

In addition to the development of a common standard (RDF), the potential for data and literature content integration to realize the Semantic design principles is influenced by a number of factors: e-science, Networked Science or small science, dark data, the surge in scientific research, emerging descriptive standards, and format constraints of scientific journals.

E-science represents "scientific investigations performed through distributed global collaborations between scientist and their resources, and the computing infrastructure that enables this" (Goble et al. 2006). In e-science, information from journals or documents represents only a small percentage of the total available (Gerstein & Junker 2001). It is the Deep Web or Data Web that dominates the total amount of data on the web (Goble et al 2006). The Human Genome Project (HGP) typifies e-science. HGP was a 13 year international effort identifying 20,000-25,000 genes in human DNA, resulting in the creation of over 500 datasets and tools (USDOE 2008; Goble et al. 2006). Data is the cornerstone for all scientific research and e-science is both a consumer and a producer of data.

The current trend in scientific research has shifted from Big Science to Networked or small science. Rhoten (2007) outlined the progression from WWII era Big Science, to the 1990s Team Science, and lastly to Networked Science. Big Science, described by Rhoten, was top-down, hierarchical and vertical, requiring both expensive equipment and an enormous nfrastructure. Researchers were tightly managed and generally employed for life. The 1990s saw a shift to Team Science with a structure that was more bottom-up, horizontal and multidisciplinary. The 21st century ushered in Networked Science characterized as loosely-coupled and geographically open. Rhoten's view is that affiliation and even reputation are less relevant in Networked Science. Access to technology and personal motivation are the pivotal features of Networked Science. Data generated from Networked Science is predicted to be 2 to 3 times greater than that produced by Big Science (Carlson 2006).

Another emerging factor driving journal content and data integration is dark data. Defined by Thomas Goetz from Wired Magazine, dark data "doesn't yield a dramatic outcome — or, worse, the opposite of what researchers had hoped? It ends up stuffed in some lab drawer. The result is a vast body of squandered knowledge that represents a waste of resources and a drag on scientific progress" (Goetz 2007). Goetz expressed the need to 'free' dark data from failed scientific experiments. The addition of dark data into the data pipeline represents an enormous untapped information reservoir.

A global surge in scientific literature is expected as China, India, and other countries become major research institutions (Shulenberger 2005). India is "backed by a rapidly growing private sector and a government that values science is poised to transform itself into a world-class science and technology powerhouse" (Yarnell 2006). A report issued by the NSF, Thomson Scientific, Chemical Abstracts, and the American Chemical Society in 2006, identified growth of scientific literature worldwide and a surge in papers from rapidly developing nations (Heylin 2006).

A data deluge is the expected byproduct of e-science, small science, dark data and the global surge in scientific research. IDC, a global provider of market intelligence related to information technology, published a White Paper in 2007 forecasting an astronomical increase of information. The IDC predicted that between 2006 and 2010 information added annually will increase more than 6 fold, from 161 to 988 exabytes (Gantz 2007). For scale, 5 exabytes is equal to the information contained in 37,000 new libraries the size of the Library of Congress (Gantz 2007).

Data without meaningful description is of little significance. The advancement of web-based standards for encoding knowledge is essential for creating meaning. Content description, long a forte of librarians, has evolved in the online environment with the implementation of metadata standards. Metadata are a subset of data, summarizing "data content, context, structure, interrelationships, and provenance" (NSF 2007). The Dublin Core Metadata Initiative (DCMI), crafted by information professionals, resulted in the development of interoperable, international metadata standards providing an intelligent means of describing resources (DCMI 2008). Although, the DCMI proposal is a step in the right direction, currently "no single approach has been embraced by the electronic publishing domain" (Casher & Rzepa 2006). The metadata model in the forefront for encoding scientific content is RDF which "is sufficiently expressive and can be linked to other web resources in order to be the foundation for knowledge sharing and complex scientific transactions" (Szalay 2006). A special case of RDF, OWL, provides a "systematic description of a given phenomenon" and "includes controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse" (NSF 2007). Taking this one step further, Neumann et al. (2004) expressed a need for "information systems to include science-specific semantics."

Present day scientific journal formats are not useful in a data-centric world. Seringhaus and Gerstein (2007), declared the death knell of the published paper in its present format. The authors cited the following inadequacies of the published paper,

Useful data is not formatted for inclusion with journal publications
Some facts are considered too trivial (i.e., omission of isolated findings or negative results)
Some data sets are too large for article formats
Article formats lack the rigorous organizational structure of databases

In addition, Seringhaus and Gerstein (2006) questioned the value of the published paper as a resource "Often the journal article, the bedrock of peer-reviewed scientific knowledge, is the last information source consulted" and that "the scientific manuscript as we know it has outlived its usefulness."

Scientists Visions of the Merged Format

The cross-disciplinary integration of data and literature, combined with intelligent retrieval and perpetual access, demands change to the current information structure. Considerations for this merged structure have been described by scientists who have called on publishers to initiate the integration.

Bourne (2005) described a transformation of the traditional journal into a data resource. This expanded functionality would allow the journal to be used more like a database. Bourne encouraged collaboration between publishers and database providers so that "If the connection is transparent to the reader, the paper has thus become a detailed entry point to the database and the database has become the detailed entry point to the literature."

Neumann et al (2004) proposed the expansion of RDF embedded models and semantics within online publications thereby creating relational connections by capturing "the primary conclusions of papers in a form that can be queried by a machine, which also relates to biological entities to typed data web-objects (e.g., public genomic databases)" (Neumann et al. 2004). The authors envisioned "a system that understands and links meaning, not just matching strings of letters during a query."

Seringhaus and Gerstein (2007) offered their vision of information architecture that would capture data in digital format, facilitating "database deposit along with manuscript publication." The authors called for indexing "all full-text journal articles, associate keywords and identifiers with database records, and linking text-books, laboratory web sites and high-level commentary." Essentially, a one-stop "shopping" experience. Seringhaus and Gerstein stressed the need for scientists to "publish, share and access data on the web." In recognition of the global nature of science research, Seringhaus and Gerstein (2006) proposed that "the driving force behind data integration should not be a single American entity; instead, it should be a collaborative effort driven by journals: decentralized information, central access."

In Toward 2020 Science, a report published by Microsoft Research Cambridge, an international panel of experts surmised that "scientific journals will in some sense need to become databases." The panel also predicted the emergence of hybrid scientific publications "that combine the strengths of traditional journals with those of databases" (Emmott & Rison 2006). Carlson referred to the Microsoft Research paper Towards 2020 whereby some journals may be composed primarily of raw data (Carlson 2006).

Scientists' envisioned the merged format resulting from publisher adaptations, a change from without rather than change from within academic institutions.

Librarians Role in Merged Formats

Scientific progress increasingly relies on searchable and intelligent integration of data sets, mined in conjunction with journals and other resources. The emerging role for librarians is data curation at the local/institutional level. Curation has been defined by the DCI (Digital Curation Centre) in the UK, as "maintaining and adding value to a trusted body of digital information for current and future use; specifically…the active management and appraisal of data over the life-cycle of scholarly and scientific materials" (Gold 2007; DCI 2009).

Campbell (2007) argued that "the Semantic Web initiatives of the World Wide Web Consortium offer the closest and most reasonable link between the emerging web environment and library services." Librarians, as creators of metadata, according to Campbell, are adroit in recognizing content, using controlled vocabulary, and categorizing appropriately. Witt (Witt in Carlson 2008) challenged colleges to consider librarians as data curators. According to Witt, data curation is a natural progression for librarians who have expertise in information selection, appraisal, organization, classification, preservation, and access. Lougee (Lougee in Albanese 2002) also echoed the call for librarians as participants in constructing the Semantic Web

Librarianship skills may be applicable to data curation; however a study from the United Kingdom, RIN-CURL (Research Information Network-Consortium of Research Libraries) posed the question "Is this a job for academic librarians?" (RIN-CURL 2007). The RIN-CURL project surveyed 2250 researchers and 330 librarians to find out how researchers interact with libraries. The survey reported that the role for academic librarians as partners in e-science remains unclear and that the picture is even 'less clear' in terms of data curation services. Existing data centers (national and international) are large, requiring staffing that exceeds most academic libraries and employ a "specialist workforce with a combination of advanced subject knowledge and informatics skills" (RIN-CURL 2007). In conclusion the study emphasized the need for further research in order to better clarify the "roles and responsibilities of all those involved in the research cycle—researchers, research institutions, and national bodies, as well as libraries—in managing the increasing volumes of digital research outputs" (RIN-CURL 2007).

An initiative identifying the role of libraries regarding the challenges of e-research and data curation from the UK is the DISC-UK (Data Information Specialists Committee) DataShare project. Group members include the Universities of Edinburgh, Oxford, and Southampton, and the London School of Economics. The group convened in 2004 in order to share data support experience and expertise with an emphasis on "bringing research libraries into the field of data curation, while supporting data management and e-research activities via open access institutional repositories and Web 2.0 technologies" (MacDonald & Uribe 2008). DataShare continues to develop best practices for repository technologies, metadata formats, and open data archiving and access.

In the United States the Association for Research Libraries (ARL) appointed a joint task force addressing e-science challenges. The report, Agenda for Developing E-Science in Research Libraries (2007) provided background information on e-science issues within the research library community, followed by recommendations addressing the challenges. Task force members expressed the potential for e-science "to be transformational within research libraries by impacting their operations, functions, and possibly even their missions" (ARL 2007). In outlining the role for academic research librarians, the task force members imparted a sense of unknown, "we are deeply aware that approaching these transformational issues from the constrained perspective of current condition will ultimately be insufficient" (ARL 2007). The final report listed 11 principles modeling roles for research libraries in e-science, including digital stewardship, open access, open data, and metadata creation. These principles provided a foundation for academic librarians, however actualizing these principles will require institutional directives, support, training, and above all, communication (ARL 2007).

Although the role and/or work for research librarians has yet to be determined, librarians at Purdue University are testing the waters by consulting with researchers, reviewing data and creating metadata for the distributed institutional repository (i.e., data remain on faculty hard drives, departmental servers, or the TeraGrid—"a large-scale computing project run by a handful of institutions, including Purdue" (Carlson 2006). Purdue's distributed repository, still in the early stages, affords the library community an opportunity to see data curation in action "identifying those areas that show potential for research library engagement and collaboration" (ARL 2007).

Discussion

As evidenced in the literature, librarians and researchers have different perspectives for achieving data integration. The scientists called for publishers to adapt journal formats to e-research while the librarians focused on data curation at the institutional level. Casher and Rzepa (2006) proposed "a "deconstructed journal" model, aligned to the positive aspects of the web while addressing diverse stakeholders interests." Casher and Rzepa's model incorporates data and information beyond the publisher boundaries, driven by the need to "create semantic relationships between electronic articles to establish context and a community of importance to readers." The evolution of the scientific journal may eventually mirror the vision of the merged formats set forth by the scientists; however it is more likely the changes will come from within higher education rather than scientific publishers.

The movement for data mining is also a movement for unrestricted access. To achieve the vision of the Semantic Web will require moving away from the restrictions and limitations of expensive and exclusive scientific publishers for "In the current model for communicating research results, universities are the big losers and publishers are the principal winners" (Bravo & Diez 2007). In that regard, universities are creating digital institutional repositories (IRs) for faculty research. IRs provide a shared access point to published and unpublished works, including data. IRs also ensure that once data is created it will be cataloged, becoming part of the institutional memory.

Scientists and librarians need to explore their common ground, seeking change from within the institution, for identifying and incorporating data sets to be mined along with published and unpublished literature. Scientific journal publishers may have a role in the merged formats, however "at present it seems as if in most disciplines, leadership will fall to stewardship services run by universities and government agencies" (Lynch 2008). Data librarians from the UK, MacDonald and Uribe echoed the call for institutional leadership "Higher education institutions need to take some responsibility with regard to implementing effective data management systems for research data outputs" (MacDonald & Uribe 2008). Even scientific publishers conceded the point in a Nature editorial from 2008 "Universities and funding agencies need to provide and support curation facilities, tools and training" (Nature 2008).

Pollock (Pollock in Akerman 2008) offered this opinion on scientist's moving away from publishers "Much scholarly communication takes place outside the STM publishers' domain, via conferences, proceedings, data sets and so forth, none of which fit the process of the peer reviewed research article. Scientists have long (always?) been collaborative creatures—and the digital age means that scientists, and science itself, no longer need publishers to handle the distribution and sharing of information."

Librarians, in consultation with scientists, need to explore data curation and, regardless of whether librarians evolve into 'data librarians', their role in developing IRs is essential. As information curators, librarians provide the professional knowledge for ensuring that the principles guiding academic education and scholarship prevail. Librarians are also purveyors of knowledge dissemination, supporters of open data and open access publishing initiatives. In 2004 the National Academy of Sciences (NAS) issued a report outlining rules and procedures for data sharing. A distinguished panel of life science leaders authored the NAS report and also introduced an acronym to convey the new ethic: UPSIDE (Universal Principle of Sharing Integral Data Expeditiously). Tom Cech, president of the Howard Hughes Medical Institute (HHMI) and Nobel Laureate in Chemistry, chaired the panel and described the goal of the UPSIDE ethic as "a stake in the ground" guiding good behavior tethering the community "if they wander away, they will feel a tug" (Cech in Marshall 2003). Grant proposals from the NSF and the National Institutes of Health (NIH) increasingly require plans for archiving data (Brandt 2007). At Purdue, science faculty have "sought library science expertise in building metadata for an ontology that allows tracking of data through the workflow" as they "view the libraries as 'trusted' partners" (Brandt 2007).

The practice of librarianship, described by Gold, is "rooted in the management and 'delivery' of relationships" and that "data is an encoding of relationships in the world, whether those relationships involve instruments, physical phenomena, social entities, measurements, time, place, or other intellectual constructs" (Gold 2007). Guided by the principles of curation and dissemination, librarians offer the capacity to realize the Semantic Web by developing relationships that bridge the work of scientists with society.

References

Akerman, R. (2008, July 7). Science library pad: STM publishers imprisoned in their own walled gardens? Message posted to: http://scilib.typepad.com/science_library_pad/2008/07/stm-publishers.html.

Albanese, A. 2002. The semantic web we weave. NetConnect 28(4):125-126.

ARL. 2007. Agenda for Developing E-science in Research Libraries: Final Report and Recommendations to the Scholarly Communication Steering Committee, the Public Policies Affecting Research Libraries Steering Committee, and the Research, Teaching, and Learning Steering Committee. [Online]. Available: http://www.arl.org/bm~doc/ARL_EScience_final.pdf [Accessed March 21, 2008].

Bourne, P. 2005. Will a biological database be different from a biological journal? PLoS Computational Biology 1(3):0179-0181.

Brandt, D.S. 2007. Librarians as partners in e-research: Purdue University Libraries promote collaboration. C&RL News [Online]. Available: {http://crln.acrl.org/content/68/6/365.full.pdf+html} [Accessed September 16, 2008].

Bravo, B.R. & Diez, M.L.A. 2007. E-science and open access repositories in Spain. International Digital Library Perspectives 23(4):363-371.

Campbell, D.G. 2007. The birth of the new web: A Foucauldian reading of the semantic web. Cataloging & Classification Quarterly 43(3/4):9-20.

Carlson, S. 2006. Lost in a sea of science data. The Chronicle of Higher Education [Online]. Available: http://chronicle.com/weekly/v52/i42/42a03501.htm [Accessed March 13, 2008].

Carlson, S. 2008. How to channel the data deluge in academic research. The Chronicle of Higher Education, [Online]. Available: http://chronicle.com/weekly/v54/i30/30b02401.htm [Accessed June 8, 2008].

Casher, O. & Rzepa, H.S. 2006. SemanticEye: A semantic web application to rationalize and enhance chemical electronic publishing. Journal of Chemical Information and Modeling 46:2396-2410.

De Roure, D., Jennings, N.R., & Shadbolt, N.R. 2003. The semantic grid: a future e-science infrastructure. In: Berman, F., Fox, G., and Hey, A.J.G., eds. Grid Computing-Making the Global Infrastructure a Reality; New York: Wiley, p.437-470.

DCI (Digital Curation Centre). 2009. About the DCC. [Online]. Available: http://www.dcc.ac.uk/about/ [Accessed August 8, 2008].

DCMI (Dublin Core Metadata Initiative). 1995-2009. About DCMI. [Online]. Available: http://dublincore.org/about/ [Accessed May 28, 2008].

Emmott, S. & Rison, S. 2007. Towards 2020 science. [Online]. Available: http://research.microsoft.com/towards2020science/downloads/T2020S_ReportA4.pdf [Accessed March 3, 2008].

Gantz, J.F. 2007. The Expanding Digital Universe: A Forecast of Worldwide Information Growth through 2010. [Online]. Available: {https://web.archive.org/web/20070308120733/http://www.emc.com/about/destination/digital_universe/pdf/Expanding_Digital_Universe_IDC_WhitePaper_022507.pdf} [Accessed March 27, 2008].

Gerstein, M. & Junker, J. (2001, May 7). Blurring the boundaries between scientific 'papers' and biological databases. Nature Webdebates. Message posted to: http://www.nature.com/nature/debates/e-access/Articles/gernstein.html [Accessed October 20, 2009].

Goble, C., Corcho, O., Alper, P., & De Roure, D. 2006. E-science and the semantic web: a symbiotic relationship. In: Carbonell, J.G. and Siekmann, J., editors. Discovery Science: 9th International Conference, DS 2006, Barcelona, Spain, October 7-10, 2006. Berlin: Springer (Lecture Notes in Computer Science 4265), p.1-11.

Goetz, T. 2007. Freeing the dark data of failed scientific experiments. Wired Magazine [Online]. Available: {http://archive.wired.com/science/discoveries/magazine/15-10/st_essay} [Accessed March 27, 2008].

Gold, A. 2007. Cyberinfrastructure, data, and libraries, part 1. D-Lib Magazine [Online]. Available: http://www.dlib.org/dlib/september07/gold/09gold-pt1.html [Accessed August 21, 2008].

Heylin, M. 2006. Globalization of science rolls on. C&ENews 84(48):26-5.

Lynch, C. 2008. How do your data grow? Nature 455(7209):28-29.

MacDonald, S. & Uribe, L.M. 2008. Libraries in the converging worlds of open data, e-research, and web 2.0. Online 32(2):36-41.

Marshall, E. 2003. The UPSIDE of good behavior: make your data freely available. Science 299(5609):990.

NSF (National Science Foundation). 2007 Cyberinfrastructure vision for 21st century discovery. [Online]. Available: {https://www.nsf.gov/pubs/2007/nsf0728/index.jsp} [Accessed March 15, 2008].

Nature. 2008. Community cleverness required. Nature 455(7209):1.

Neumann, E.K., Miller, E., & Wilbanks, J. 2004. What the semantic web could do for the life sciences. DDT: BIOSILICO 2(6):228-236.

Rhoten, D. 2007. The dawn of the networked science. The Chronicle of Higher Education Chronicle Review [Online] Available: http://chronicle.com/weekly/v54/i02/02b01201.htm [Accessed March 18, 2008].

RIN-CURL. Researchers' use of academic libraries and their services. [Online]. Available: {http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/researchers-use-academic-libraries-and-their-serv} [Accessed August 1, 2008].

Seringhaus, M.R. & Gerstein, M. 2006. The death of the scientific paper. The Scientist 20(9):25.

Seringhaus, M.R. & Gerstein, M.B. 2007. Publishing perishing? towards tomorrow's information architecture. BMC Bioinformatics [Online]. Available: http://www.biomedcentral.com/1471-2105/8/17 [Accessed March 18, 2008].

Shulenberger, D. 2005. Public goods and open access. New Review of Information Networking 11(1):3-10.

Szalay, A., & Gray, J. 2006. Science in an exponential world. Nature 440:413-414.

Tauberer, J. 2006. What is RDF? [Online] Available: {https://www.xml.com/pub/a/2001/01/24/rdf.html} [Accessed May 21, 2008].

USDOE (United States Department of Energy). About the Human Genome Project. [Online]. Available: http://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml [Accessed May 22, 2008].

Yarnell, A. 2006. Indian science rising. C&ENews 84(12):12.

Previous	Contents		Next
Issues in Science and Technology Librarianship		Fall 2009
DOI:10.5062/F4HH6H0D